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0.1. What is big data ? Explain. 

Ans. "Data of a very large size and typically to the extent that its 
manipulation and management present significant logistical challenges” is known 
as big data. 

The technologies and initiatives that involve data that is too diverse, fast- 
changing or massive for conventional technologies, skills and infrastructure 
to address efficiently is referred to as big data. The data sets that are so large, 
complex, and impractical to manage with traditional software tools are described 
by big data. 

But now the information from big data can be analyzed by using new 
technologies e.g., user web clicks can be tracked by retailers to identity 
behavioural trends that improve campaigns, pricing and stockage. 

Major web companies such as Google, Amazon, and Facebook pioneered 
businesses built on monetizing massive data volumes over the last decade. 
The new paradigms not only for extracting value from data but also for 
managing data and compute resources from data center design, to hardware, 
to software, to application provisioning were invented by them 

Another definition of big data is as follows — 


“The collection, Processing, discovery, analysis and storage of large 
Volumes and disparate types of data is enabled by the emerging technologies 
and practices, very quickly and cost effectively”. 


Q.2. What is the importance of big data ? 


Ans. The importance of big data depends upon its utilization. Data 
can be fetched from any source and analyzed to solve that enab!z us in 
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1 
terms of — | 
(i) Cost reductions | 

| 
| 
| 


Time reductions 
) New product development and optimized offerin 
(iv) Smart decision making. 
Combination of big data with high powered analytics, can have 
impact on a business strategy in following ways — Breat 
(i) Finding the root cause of failures, issues and defects į | 
time operations. ІП real | 
(ii) Generating coupons at the point of sale seein: 


habit of buying goods. 
(iii) Recalculating entire risk portfolios in just minutes. 


(iv) Detecting fraudlent behaviour before it affects and risks ош 
organization, 
Q.3. Write short note on Drivers for big data. 


85, and 


8 the customer, | 


Ans. There are three contributing factors or drivers for big data. These | 
drivers are consumers, automation and monetization, 

More than cach of these contributing factors, their interaction is speeding 
the creation of big data. With increasing automation, it is easier to offer big | 
data creation and consumption opportunities to the consumers and the | 
monetization process is increasingly providing an efficient marketplace for big 
data, These drivers are explained below — 


(Ù Sophisticated Consumers — The increase in information level and 


consumers are far more analytic, far savvier at using statistics, and far more 
connected, using social media to rapidly collect and collate opinion from others. 
(ii) Automation — Marketing and sales have received their E 

boost in instrumentation from Internet-driven automation over the past 
years. | 

` 2 " . шу 
Browsing, shopping, ordering, and customer service on the web a 
has provided tremendous control to user but also has created a m in 
flood of information to the marketing, product and sales organiza 


А be 
: isi з : f web clicks can ` 
understanding the buyer's behavior, Each sequence о 9 dysphori 


d about sequence 


collected, collated and analyzed for customer delight, puzzl 
or outright defection. More information can also be obtaine 
leading upto a decision. 
(iii) Monetization — А big data analytics perspectiver, а s 
is the biggest enabler to create an extemal market place wher 


» 
“data bazar 
e collect, 


the associated tools has created a new breed of sophisticated consumers. These 
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exhange, and sell customer information. 
market place, in which customer ехрегїе: 
packaged, and sold to other industries. 


. We аге Seeing a new trend in the 


as Ў uss four V's and fiveV's characteristics of big data with suitable 
Ans. Big data is important because it enables organizations to gather, store. 
manage, and manipulate vast amounts of data at the right speed, at the right 
time, to gain the right insights. In addition, Big data generators must create 
scalable data (volume) of different types (variety) under controllable generation 
rates (velocity), while maintaining important characteristics of the raw data 
(veracity), the collected data can bring to the intended process, activity or 
predictive analysis/hypothesis. Indced, there is no clear definition for ‘Big Data’. 
It has becn defined based on some of its characteristics, Therefore, these five 
characteristics have been used to define Big Data, earlier known as 4V’s (volume. 
variety, velocity and veracity), as illustrated in fig. 1.1. ' 


ГЕНЧ 


The Five V's 
Characteristics 
for Big Data 


SY 


| 


Correlations 
theti 


g 


Structured 
Unstructured 
Multi-factor 


Й. 


Fig. 1.1 Five V's Big Data Characteristics 


. (Ü Volume- t refers to the quantity of data gathered by a company. 
This data must be used further to gain important knowledge. Enterprises are 
awash with ever-growing data of all types, easily amassing terabytes even 
petabytes of information (e.g., turning 12 terabytes of tweets per day into 
Improved product sentiment analysis; oc converting 350 billion annual meter 
readings to better predict power consumption). 


vt 


ШЕ 


-+ + audio, video, click streams, log files and so on. The analysis 
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Moreover, Demchenko, Grosso, de Laat and Membrey Stated that Introduction of Big Data. 7 
is the most important апа distinctive feature of Big Data, imposing ы; Ми, | structures are defined by organizations by creating a model Th del ай 
pue is deron Pans og anid tools currently used, кеп to store, process as well as gives permission to Operate the data. The тине 

“Mr (00 Veloci = It refers to the time in which Big Data defines the characteristics of data including data буре and some restrictions on 
processed, Some achvines 2% Үзу мер ontant and need immediate Tes n bs | the data. Analysis and storing of Structured data is very easy, Because of high 
which is why fast processing maximizes efficiency. For time-sensitive опа, cost, limited storage space and techniques used for Processing, causes RDBMS 
like fraud detection, Big data flows must be analyzed and used as вр соу | the only path to store and process the data effectively. Programming language 
into the organizations in order to maximize the value of the informati, Sttean | called structured query language (SQL) is used for managing this type of data. 
scrutinize 5 million trade events created each day to identify Potentia] È | (ii) Semi-structured Data — Data келеген ік oes faceret 
analyze 500 million daily call detail records in real-time to Predict си Жш, data but does not fit the data model is semi-structured data, It cannot be 
{тй a өс С 

ГЈ (iii) Variety — It refers to the type of data that big data can Comprise. pone sas rule d the data is aed to be Mond with seine 

This data may be structured or unstructured. Big data consists different (узе, This form of data increased rapidly after the introduction of the Worle. Wide 

of data, including structured and unstructured data such pes | 


as text, sensor data, | Web where various form of data need medium for interchan ging the inforination 
of combined datz} like XML and JSON. 


types brings new problems, situations, and so on, such as monitoring hundreds Example — CSV, XML and JSON documents are semi-structured 


of live video feeds from surveillance cameras to target points of interest, documents, NoSQL databases are considered as semi-structured. 
exploiting the 80% data growth in images, video and documents to improve 7 


(iii) Unstructured Data — Data without any specific structure and 
customer satisfaction. due to this could not be stored in a row and column format is unstructured 

(iv) Value — It refers to the important feature of the data which is| data. This data is contradictory to that of structured data. It cannot be Stored 
defined by the added-value that the collected data can bring to the intended 


in a databank. Volume of this data is growing extremely fast which is very 
process, activity or predictive analysis/hypothesis. Data value will depend ол | tough to manage and analyze it completely. To analyze the unstructured data 
the events or processes they represent such as stochastic, probabilistic, regular | advanced technology knowledge is needed. 


or random. Depending on this the requirements may be imposed to collect all Fig. 1.2, depict these types of big data along with example. 
data, store for longer period (for some possible event of interest), etc. In this 
respect data value is closely related to the data volume and variety. | 

(v) Veracity — It refers to the degree in which a leader trusts information 
in order to make a decision. Therefore, finding the right correlations in Big Data 
is very important for the business future. However, as one in three business = 
leaders do not trust the information used to reach decisions, generating inst n 
big data presents a huge challenge as the number and type of sources grows. 


0.5. Explain big data types with examples. 
. Or 
Write short note on structured and unstructured data. 


Big Data Types 


Semi-structured 
Data 


Unstructured 
Data 


eg, 
CSV, XML, JSON 


d Computer or Human Computer or Human 
[(R.G.PV., May 2019 (УШ-8ет. қ А Масе Generated, Machine Generated, 
E ee! enerated, e.g., [3 Generated, c.g., ед, 
Ans, Big data encompasses everything, from dollar transactions to tw 


Click-stream data 
Input data 
Gaming related data 


Satellite images - Text internal to company 


+e that all 
1 У : ө ires that а 
fo images to audio. Therefore, taking advantage of big data requ Scientific data Social media data 


ы : t. This 5 
this information to be integrated for analysis and data managemen 


‘oh velocity Financi Photographs Mobile data 
more difficult than it appears. Big data includes huge volume, bigay жч nancíal data a "S n wi content 
and extensible variety of data. There are three types of data concern 


. “әп! databases 
() Structured Data — This is the data stored in relation and thes* 


Fig. 1.2 Big Data Tj 'ypes 
table in the format of row and column. They have fixed structures 
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0.6. Give advantages 9f big data over traditional q, 
ata, 


small se 
у different computers presen į 

ts communicate with each к | 
ase is in lower 
ing. Distributed 
cal as Compared. 


de better comput: 


: Micro hich is economi 
to centralized database which is based on mainframe and distributed database 


has more computational power as compared to traditional. Traditional database . 
systems are based on structured data whereas big data uses semi as well as 
unstructured data. Traditional database store small amount of data which Tange | 
from some giga-bytes to terabyte however big data can store and analyze data | 
ranging from hundreds of terabytes or petabytes and more. Storing large | 
. amount of data reduces the cost which will help the business intelligence (BI). 
Big data uses dynamic schema, whereas traditional database uses fixed schema, 
which cannot be changed once saved. Traditional database system requires | 
complex and expensive software and hardware for managing large amount of 
data. While in big data, the large data is divided into several systems, thus 
amount of data in each system is reduced. This makes the use of big data 
simple and cheap. 


0.7. Compare traditional data and big data. 2 sm | 
Ans. The comparison of traditional data and big data is given in table 1-- 


Table 1.1 Comparison of Traditional Data and Big Data 
Advantage of 


Traditional Data 


Cost effective 


Distributed 
database 
Unstructured and 
semi-structured 
Small amount of |І агре amount of data: 
data, Range — Giga-| Range — <petabytes 
byte-terabytes 


]ligence 
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Data schema | Fixed schema Dynamic schema 


Preserves the 
information in data 


Data Relationship with | Difficulty in rela- 


relationship | data is explored tionship between 
casily. data items 
Scaling More than one server | Single server for — |Cost effective 


for computing 
Less accuratc results 


computing 
High accurate 
results 


Q.8. Describe history of big data. 

Ans. Big data is a long evolution of capturing and using of data and not a 
new phenomenon. Big data is thc future act that will bring change in the way 
we run society, just like the other developments in storage of data, processing 
of data and internet. The ancient history of data is when humans used tally 
sticks for storing and analysis of data about C 1800 BCE. The tribal peoples 
used to mark notches into bones or sticks for calculations, which would make 
them predict about how long their food would last. One of the carliest prehistoric 
data storage is Ishango Bone now known as Uganda which was discovered in 
1960. Then in C 2400 BCE came the very first device particularly for performing 
calculations — Abacus. Our first libraries also appeared in this time period 
which represented our initial step towards mass storage. Then in the period of 
300 BC-48 AD thc library containing largest collection of data of the historic 
world which covered pretty much everything which we learned so far was 
destroyed by Romans accidentally. Then the earliest mechanical computer 
was developed by Greck from C 100-200 AD whose CPU consist of 30 bronze 
gears. It was designed for astrological purposes and tracking cycle of Olympic 
Games. After this many small discoveries laid the foundation to emergence of 
statistics like first recorded experiment in statistical data analysis. In 1880, 
Hollcrith Tabulating Machine was developed that used punch cards for 
calculation purposes that completed 10 years of work in 3 months designed 
by Herman Hollerith known as the father of automated computation etc. Then 
started the early stage of modern data storage. In 1928 а German-Austrian 
engincer Fritz Pfleumer invented a magnetic tape which stored information 
magnetically. Then came the Business Intelligence and start of large data centers 
where ideas of relational database and Material Requirement Planning systems 
were out forward. 


Accuracy Confident results 


and reliable 


In 1989 the first use of the term big data was made by Erik Larson in the 
Harpers Magazine where he said that “The keepers of big data say they are 
doing it for the consumer’s benefit. But data have a way of being used for 
purposes other originally intended”. The birth of World Wide Web took place 
that kicked internet into gear in 1991. Google search engine started in the year 
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1997. After a couple of years in 1999 big data term appeared in , | 
paper published by the Association for Computing Machinery, In ih Тезеагу | 
large amounts of data and inadequate space for storape as e i Зо 
difficulties were highlighted. In 2001, characteristics of big data ^» analys, | 
velocity, variety, were defined by Dough Laney, In 2005, creation or lune. | 
source framework called Hadoop took place for storing and analyzin n OPen 
sets. Hadoop was famous for its flexibility and management of both dips day 
and unstructured data. Life on carth evolved around 4 billion years Clureq | 
which over last 6 million years human evolution occurred, out of hee in 
100000 years ago human language evolution started then about 70000 abon 
ago cognitive evolution started and then finally was the scient уч 
which happened about 500 years ago and fortunately analytics evolution į 
: $ 3 is 
about just 30 years old but still unfold. It was started in 1990s as Analytics 1,9) 
also known as ‘Business Intelligence’. In this data about production Process, 
interaction of customer etc. were collected, combined and analyzed by traditiong | 
databases where data used to fit neatly and stored in rows and columns, hi 
Analytics 1.0 era, more time was spent on preparing data for analysis than 
analytics itself by IT & Business Analytic. Then in 2000s came the Analytics | 
2.0 or well known as big data which had complex queries that had views of | 
both structured as well as unstructured data. To deal with such fast processing | 


across parallel servers software like Hadoop, NoSQL etc., have been developed. | 


ist evolution 


! 


0.9. Explain in detail about challenges in big data. 


Ans. There are six major challenges areas in big data; those areas are | 
shown in fig. 1.3. | 


Accessibility 


Big Data 
Challenges 


Inconsistencies 


Storage and 
Transport 


Fig. 1.3 Major Challenges in Big Data " 
big dala. 


: SN , es of 
Timeliness — Timeliness is one of the challeng dris wil als 


@) 
volume of the data increases, the time taken to anal 
increase. Some cases like analyzing fraudulent act! 
Processing. But it is not possible to get a full analysis of the 
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before the fraudulent transactions can occur. So, more the volume and variety 
of data is, the more timc is needed for a complete analysis of big data. So 
companies like bank and security companies face timeliness challenges of 
analyzing the big data. 


(ii) Security — Big data is worried with a considerable measure of 
utilization cases such as staging, pre-handling, processing, meta information 
stockpiling and to store flceting and long haul truth information. For serving 
every usage case multi-elements of establishment required. Secure and private 
trades are the two significant stress of IT. Be that as it may, the security and 
assurance transforms into a question mark as the truths volume of gignatic 
data rapidly create. When we think the security perspective, the accessible 
cryptography benchmarks cannot gather the requests of enormous information. 
Subsequently, security insurance is still one additionally difficult issue in 
enormous information. . 


(iii) Storage and Transport — Big data stores and oversee information 
in various courses from the customary information distribution centers. It 
envelops substantial sensor information; crude and semi-organized enlist 
information of IT businesses and the detonated measure of information from 
online networking. In addition, data is being made by itis conceivable that one 
or by all (i.e. from PDAs to supercomputers and by specialists, analysts, 
editorialists, researches, etc.). 


(iv) Accessibility — The rapid pace of development in information on 
the web challenges the scientists to advance proficient calculations and preparing 
advances. The strategy for getting to enormous information has two 
appearances. First, thc information in the source side and communicate comes 
about. In this way, the upgrade of scripting innovations on the program side is 
required to bring fundamental code from the server. Second, communicating 
just the genuine information in the wake of applying legitimate channels. 


(v) Inconsistencies — The range of this revicw i.e., big data is 
encompassed with multi-dimensional, specialized and precise spaces. The target 
of big data likewise changes over partner to partner. The big data analytics is 
a rising edge for development and progression of innovations. In this way, its 
effect on society ought not to be maintained a strategic distance from. This 
appears to be evident that sooner the big data would withdraw to wrap all 
these areas and parts like special sciences, lite and physical sciences, 
communique, capital and so forth. Big data includes every single space; in this 
manner irregularity exists either in information level, data level, or learning 
level. This inconsistency in every level must be tended to. Irregularity has 
been separated in four categories i.e., temporal, textual, spatial and functional 
inconsistency. 
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(vi) Mobility — Organizations are ready to expand m | 
applications which are being acknowledged by cell phones. e. тоц. 
of versatile processing advancements and endeavor advancement сорон 
has an awesome potential to swell efficiency in business E отон | 
heightening area based datascts, invasion of information gum, Th | 
applications, their measurement and assortment surpasses the sock versatile 
and portable figuring advancements. The online conduct Sf van tty of Spatig, 
clients has contributed a considerable measure to the data жн Versatile | 
of custom controlling organizations (checking GPS and spatial ras #8 Union | 
gigantic data model could defy a couple of significant issues: in ж, into th 
development in computational cost since its augmentation impacts E Case, the 
request to PDAs, secondly, it uses geographical thinking in remotely i Бф 3 
апі conclusion after some time and space. The inalienable development e | 
in mobile phones begin a colossal measure of data from every customer's ^ 


Q.10. Discuss the issues in big data. 


. ^ к ! 
Ans. The issues in big data are very few and while adopting the technology | 
competently, one should clearly know Бу its organization. These data issus 


“аге discussed below — 
р 


(i) Issues Related to the Characteristics — 


(a) Data Volume — As the data size increases, the estimation of | 
different data records diminish in ratio to age, sort, riches, and amount amidst ! 
different elements. The long range interpersonal communication destinations 
accessible are themselves delivering information in terabytes over and over | 
this measure of information is obviously difficult to handle with the current | 


customary frameworks. 


(b) Data Velocity — The customary frameworks are nol 
sufficiently skilled to play out the investigation on the information which Б 
persistently changing or expanding. The rising online business has immediately 
expanded the speed and fortune of information. | 

(c) Data Variety — The data comes in various peat 
for example, crude, organized, semiorganized, and unstructured data, al 
distinctive configurations are difficult to handle by the reachable iom 
explanatory frameworks. From the investigative perspective the disappoint | 
of customary scientific framework is a principle snag to successfully me) 
the immense volume of information. In any case, jumbled data designs, a опат! | 
data structures, and confused information semantics speak to unimp | 


difficulties that can prompt to investigative fall. А 

; (d) Data Value - As the stored data is utilized s business 
associations for data analytics, which made a sort of crack bep estimatio” 
Pioneers and IT specialists, as business pioneers needs to build the 
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of their business and pick up progressively advantage not in any manner like 
the IT pioneers who have stress with the subtle elements of the limit and 
planning. 

(ii) Storage and Transport Issues — The measure of information 
has exploded every time we have fanciful once more stockpiling medium. The 
uniqueness about thc most current data explosion, primarily because of web- 
based social networking, is that no new storage medium has bcen introduced. 
Exabyte of data could be set up on a single PC edge work; it is unfit to clearly 
join the basic number of circles. Access to that data would overcome current 
correspondence frameworks. Tolerating that a | gigabyte for consistently 
framework has an accommodating viable conversion standard of 80%, the 
reasonable transmission limit is around 100 Megabytes. 

(iii) Data Management Issues — The data information management 
isthe most complex issue while cooking enormous information. Settling issues 
of perfect to use, utilize, redesigning, organization, and reference. The well 
springs of the information arc varied by size, by setup, and by technique for 
social occurrence. 

Main data management issues are — 
(a) Data privacy (b) Security 
(c) Ethical (d) Governance. 

(iv) Processing Issues — Let an exabyte of data ought to be prepared 
completely and arranged orderly. For simplicity, consider the data is pieced 
into squares of 8 words, so 1 exabyte = 1 К petabytes. Expecting a processor 
utilizes 100 rules on one square at 5 gigahertz, the time required for end-to- 
end get ready would be 20 nanoseconds. To manipulate ІК petabytes would 
require a total pier-to-pier preparation time of around 232000 days. In this 
way, capable get ready of exabyte of data will require wide parallel taking care 
of and new examination computations remembering the true objective to give 
fortunate and huge. 


(v) Processing Major Issues — 

(a) Gathering required data/information 

(b) Arranging data from different resources (e.g., resolution 
when two entities are same) т 

(с) Changing the data into а form suitable for inspection 

(d) Modelling it, whether arithmetically, or (hrough some form 
of'simulation 

(е) Understanding the output, visualize and distribute the results, 
think for a second how to display multifaceted analytics on an iPhone or a 
mobile device. 
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0.11. Discuss the security challenges of big data. 

Ans. The exponential growth of the data has opened man 
for researchers, students, and industries as well as Cyber c 
criminals can destroy both the industry and its customer: 
opportunity. The effect of weak securities can lead to th 
industry's reputation and may be subject to millions of doll 
breach settlement. So, the industries collecting large volume 
aware of security challenges while storing and processing 
of the security challenges are explained below — 


(i) Privacy — Privacy is considered as one of the primary 
challenges of big data. These days lots of companies trades customer sensit 
information based on the user's location and preferences. Hackers can x | 
track the identity of the users by analyzing the location and pattern of on 1 
activity. Once they are able to track the personal information, they can a 
that information to create duplicate debit and credit cards to be sold online | 
When the data are transferred from one company to another, there should к! 
some guarantee that the company that receives the data will fairly use the дай, | 
Sometimes the fair use of data results personal harm too. 


У орроцць:. | 
timinals, od 
$ with a singl, | 
€ destruction 3 

ars loss as a dii 
of data shoulg be 
the big data, Some 


Security | 


One of the retailers was tracking the shopping habits of customers and | 
concluded the teenage customer was pregnant. The retailer started sending 
deals and coupons related with the pregnancy products in her mailing address 
thinking it might be useful to that customer. These deals and coupon: 
unintentionally disclosed her father about the pregnancy. Even though the | 
retailer used an accurate result from the analysis, it violated the privacy of the | 
teenage customer. Future data scientists should be aware of consequences of 
using information from the analysis of big data. | 

(ii) Quantity of Loss Affected from the Security Breaches - pe | 
potential security challenges of big data are the amount of loss affected ra | 
the security breach. The more data is generated, the more zu 
consequences will result from the data compromised than that we have | 
the normal data. | 

Another security challenge of big data is maintaining the ponin an 1 
of the big data. There will be lots of users trying to access to the Tt 
Being large in size, big data needs more work to classify the importan 
data and decide whom to give access of it. 


Most companies collect data from various sources 
storing site. Data contains very sensitive information 51 
information of people, employee, financial information, 
Sensitive information might be potentially vulnerable from atta | 
can easily attack to the central database. 


t TH 
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Additional security challenges of big data are maintaining the compliance 

with the law. Some law restricts where the data should be stored and 

processed. Since company has to store and transfer data over the Internet. 

it is potentially vulnerable for companies to maintain the law. They might 

face security issues, or they might be violating the privacy of individuals 
without noticing it. 


(SESS ET р) 
TECHNOLOGIES AVAILABLE FOR BIG DATA, 
INFRASTRUCTURE FOR BIG DATA, USE OF DATA 
ANALYTICS, DESIRED PROPERTIES OF BIG DATA SYSTEM 
` E NUT UELUT 

Q.12. Explain big data technologies. 

Ans. Using modern computing technology, businesses may now manage 
immense volumes of data previously could dealt with using expensive 
supercomputers. These are now much cheaper. As a result, new techniques 
for distributed computing are main stream. Big data became paramount as 
companies such as Yahoo!, Google, and Facebook came to the realization that 
they required help in monetizing the massive amounts of data their offerings 
were creating. Thus, these new companies must search for new technologies 
to store, access, and analyze huge amounts of data in near real time. Such 
real-time analysis is required in order to profit from so much data from users. 
Their resulting solutions have affected the larger data management market. In 
particular, the innovations MapReduce, Hadoop, and Big Table have proven 
lead to a new generation of data management. These technologies will allow 
businesses to address one of the most fundamental problems, namely the 
compability to process massive amounts of data efficiently, cost-effectively, 
and quickly. 


(i) MapReduce — MapReduce was designed by Google to 
efficiently carry out a set of functions against a large amount of data in 
batch mode. The “map” component distributes the programming problem 
or task across a large number of systems while managing placement to 
balance the load and allow recovery from failures. After the distributed 
computation is complete, another function called “reduce” aggregates all the 
elements back together to provide a result. An example of MapReduce would 
be determining the number of pages in a book that are written in each of 50 
different languages. 


. (i) Big Table — Big Table was developed by Google to be a 
distributed storage system to manage highly scalable structured data. Data 
is organized into tables with rows and columns. Unlike typical relational 


16 Big Data 


database models, Big Table is a sparse, distributed, persistent m 
sorted map. It has been designed to keep large volumes 
commodity servers. 


(iii) Hadoop - Hadoop is an Apache-managed softwa; 
created using MapReduce and Big Table. Hadoop allows applications b 
MapReduce to run on large clusters of commodity hardware, The pr asd 
become the basis for the computing architecture underlyin g Yahoo!'s ч ha | 
Hadoop is designed to parallelize data processing across computi usines | 


4 Hur n ү) 
speed computations and diminish latency. Two major components B nodes, | 


ultidime, 


üsi 
of dat . Pu 


ats Wero. | 
| 


re framewoy | 


batches. 


exist — a massively scalable distributed file system that can iier. 
of data, and a massively scalable MapReduce engine that computes ы; e 
i 


0.13. Describe big data tools and techniques. | 


Ans. Organizations use various techniques and technologies to aggregate 
manipulate, analyze and visualize big data. They come from various field 
such as statistics, computer science, applied mathematics, and economics, 
Some of them have been developed intentionally and some of them have been 
adapted for this purpose. 


|| 

To capture the value from big data, we need to develop new techniques] 
and technologies for analyzing it. Until now, scientists have developed a wide! 
variety of techniques and technologies to capture, curate, analyze and visualiz | 
big data. 


Big Data Techniques — Big data needs extraordinary techniques 
efficiently process large volume of data within limited run times. Reasonabl 
big data techniques involve a number of disciplines, including statistics, data | 
mining, machine learning, neural networks, social network analysis, signi 
processing, pattern recognition, optimization methods and visualization, 
approaches. There are many specific techniques in these disciplines, and tho. 
overlap with each other. 


@) Optimization Methods — Optimization methods have been appl 
to solve Quantitative problems in a lot of fields, such as physics, biolog” | 
engineering, and economics, 


(ii) Statistics — Statistics is the science to collect, organize 


d caus 
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Fig. 1.4 Big Data Techniques 

(iii) Data Mining — Data mining is a set of techniques to extract 
valuable information (patterns) from data, including clustering analysis, 
classification, regression and association rule learning. 

(iv) Machine Learning — Machine learning is an important 
application of artificial intelligence which is aimed to design algorithms that 
allow computers to evolve behaviours based on empirical data. The most 
obvious characteristic of machine learning is to discover knowledge and make 
intelligent decisions automatically. 

(v) Artificial Neural Network — Artificial Neural Network (ANN) 
is a mature technique and has a wide range of application coverage. Its 
successful applications can be found in pattern recognition, image analysis, 
adaptive control, and other areas. 

(vi) Visualization Approaches — Visualization approaches are the 
techniques used to create tables, images, diagrams and other intuitive display 
ways to understand data. 

(vii) Social Network Analysis — Social network analysis (SNA) 
which has emerged as a key technique in modern sociology, views social 
relationships in terms of network theory, and it consists of nodes and ties. 

Higher level big data technologies include distributed file systems, 
distributed computational systems, massively parallel-processing (MPP) 


Social 
Computing 
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systems, data mining based on grid computing, cloud-baseq 
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i : à Sto | 
computing resources, as well as granular computing and biologica таре an 


| high velocity and complex data types. Indeed, when the high velocity and time 
Сотрц a; ion are c 1 in applications that i i i 

i t ension are concerned арр ns that involve real-tim 

Big Data Tools — Current big data tools concentrate on tl кен н ооч, 


ч К ес с] there area number of different challenges to map/reduce framework. Therefore. 
namely, batch processing tools, stream processing tools, and interactive кч the real-time big data platforms, like SQL stream, storm and stream cloud, are 

ау А 
tools. yi 


designed especially for real-time stream data analytic. 


(i) Big Data Tools Based on Batch Processing — One of th Table 1.3 Big Data Tools Based on Stream Processing 
famous and powerful batch process based big data tools is Apache Надо Specified Use | Advantages | 
provides infrastructures and platforms for other specific big d P 


Real-time computation 
system 


Scalable, fault-tolerant, and 
is easy to set up and operate. 
Proven, distributed, scalable, 
fault-tolerant, pluggable 
platform. 

SQL-based, real-time stre- 
aming big data platform. 


Table 1.2 Big Data Tools Based on Batch Processing 


i Infrastructure and 


platform 


Infrastructure and 
platform 


se 
Aud 


ziii 


Processing continuous 
unbounded streams of 
data 

Sensor, M2M, and tele- 
matics applications 


High scalability, reliability. 
completeness. 


High performance distribute; 
execution engine, good | 


Prog. Collect and harness Fast and easy to use, dynamic 
rammability. | machine data environments, scales from 
Machine learning Good maturity. laptop to datacenter. 
algorithms in business Apache kafka | Distributed publish sub-| High-throughput stream of 
Business intelligence | Cost-effective, self-service || scribe messaging system epp activity data. | 
software BI at scale. SAP Hana Platform for real-time | Fast in-memory computing 
i Я 1; ; і -ti lytic. 
Business analytics Robustness, scalability, flex | business and real-time analy 
platform bility in knowledge discovery} (iii) Big Data Tools Based on Interactive Analysis — The interactive 
| analysis presents the data in an interactive environment, allowing users to 


Machine learning and 
advanced analytics 


Data visualization, 
business analytics 


Big data workspace 


Process massive datasets | undertake their own analysis of information. Users are directly connected to 
accurately at high speeds. | the computer and hence can interact with it in real time. The data can be 
Faster, smart, fit, beautiful and) reviewed, compared and analyzed in tabular or graphic format or both at the 
easy to use dashboards. | same time. 

Collaborative and standards- 
based unconstrained analytic 
and self service. 

Data management and | Easy-to-use, eclipse-base 
application integration | graphical environment. 


(a) In 2010, Google proposed an interactive analysis system, named 
Dremel, which is scalable for processing nested data. Dremel has a very different 
architecture compared with well-known Apache Hadoop, and acts asa successful 
complement of map/reduce based computations. It has capability to run 
aggregation queries over trillion-row tables in seconds by means of combining 
А | multi-level execution trees and columnar data layout. | 
(ii) Stream Processing Big Data Tools - Hadoop does Well | (b) Apache drill is another distributed system for interactive 
Processing large amount of data in parallel. It provides a general фа ле analysis of big data. It is similar to Google's Dremel. For drill, there is uem 
mechanism to distribute aggregate workload across different mac! ost, flexibility to support a various different query languages, data formats an 
Nevertheless, Hadoop is designed for batch processing. It is a Жш ЕРИ ; | data sources. 
пре but not a real-time and high performance engine, since there wi | 
throughout latency in its implementations. Stream big data has high vo 


d 
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Q.14. Write short note on processing of big data. 


have been developed and these are categorized according 
concepts. In processing a lot of content must be extracted 


The process is initiated with the retrieval of information 


which the data is actually loaded into the data warehouse. 


Q.15. Describe the architecture for big data. 


Big Data Tech Stack 


Big Data Applications 
Reporting and Visualization 
Analytics (Traditional and Advanced) 
Analytical Data Warehouses and Data Marts 
"Organizing" Databases and Tools 


Operational Databases 
(Structured, Unstructured, Semi-structured) 


Interfaces and Fecds from/to the Internet 


Redundant Physical Infrastructure 


Fig. 1.5 Big Data Architecture 
(Ù Interfaces and Feeds — What makes big 


In addition, interfaces exist 
Stack, With 


Operationa 


Out integration services, big data cannot happen. 
| database approaches include columnar data 


Ans. For big data processing, a large number of big data 


the collected information to serve the knowledge requiremen 
business organizations, political parties and scientific research 


from Various sources such as database, websites, documents or 
i г A e 
management system. Hadoop, is used for storing this massive ámou; Onley, 


Before processing big information it must be recorded from 
information creating sources. In the order of its happening it must 
and optimized. Just the pertinent information ought to be recorded 
for channels that dispose of futile data. This can be done by specific ins 
such as ETL. ETL method commonly combines data from multiple 


Ans. The architecture for big data is shown in fig. 1.5. 


А data big is the fac 
relies on picking up lots of data from lots of sources. Therefore, à 
Programming interfaces (APIs) are a core part of any big data architec 


г 
at every level and between every laye 
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technol Ў information efficiently in columns, and not rows. This approach leads to faster 


nalyzeq Ssi is part of the equation, a spatial database is optimized to store and query data 
ts of fo based on how objects are related in real terms. 


to da 
and а 


Vari 
А department (ii) Redundant Physical Infrastructure — The supporting physical 
» Which сап Ж infrastructure is fundamental to the operation and scalability of a big data 
t 


lrchitecture. In fact, without the availability of robust physical infrastructures, 
nt of d, big data would likely not have become such a strong trend 
% 


Че 
be fil 
by 


С То support ап 
‘unanticipated or unpredictable volume of data, a physical infrastructure for 


big data has to be different than that for traditional data. The physical 
tery infrastructure has been based on a distributed computing model. This means 
Methog | hat data may be physically stored in many different locations, allowing it to be 
trumen | ced through networks, the use of a distributed file system, and various big 
Systemsi Jata analytic tools and applications. 


| Redundancy is important, as companies must handle a great deal of data 
from many sources. Redundancy comes in many forms. For instance, if the 
company has created a private cloud, company may want to create redundancy 

ithin private areas so that it can scale out to support changing workloads. If a 
‘company needs to limit internal IT growth, it may use external cloud services to 
Хада to its own resources. In some cases, this redundancy may come in the form 
lof a Software as a Service (SaaS), allowing companies to carry out advanced data 
lanalysis as a service, The SaaS approach allows for a faster start at reduced costs. 


J 


(iii) Security Infrastructure — As big data analysis becomes part of 
workflow, it becomes vital to secure that data. For example, a healthcare 
company probably wants to use big data applications to determine changes in 
demographics or shifts in patient needs. This data about patients needs to be 
protected, both to meet compliance requirements and to protect patient privacy. 
‘The company needs to consider who is allowed to see the data and when they 
jmay see it. Also, the company need to be able to verify the identity of users, as 
‘well as protect the identity of patients. These types of security requirements 
imust be part of the big data fabric from the out set, and not an after thought. 
| (iv) Operational Data Sources — Concerning big data, a company 
[must ensure that all sources of data will provide a better viewpoint about the 

‘business and allow it to understand how data effects the operational methods 
(һай 0 that company. Traditionally, an operational data source consisted of highly 
‘structured data, managed by the line of business in a relational database. 
| However, operational data now has to consider a broader set of data sources, 
including unstructured sources like social media or customer data. 


(v) Performance Matters — Data architecture also must work to 
‘Perform according to the supporting infrastructure of organization or company. 
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For instance, the company might be interested in running models to d | 

m: а - 2 LT otishorerarea; provided with eR Introduction of Big Data 23 
data of temperature, salinity, sediment resuspension, and many oth aliy of data clements, but also how these data elements offer context b; 


: А er biola,- p 4 азе 
hemical, and physical properties of the water column. It mi biolo | business problem being addressed. With big data, reportin ased on the 
ci > : on night take q ‘porting and data visualization 
run this model using a traditional server configuration, Howe 


ауз | have become tools for looking at the context of how data is related andthe 
distributed computing model, a day’s long task may take minutes, 


Ver, usj 
i ine the kind of database th Performan, 
might also determine the kind of database that company woul n 


ing | impact of those relationships on the future. 

] d use. U | (ix) Big Data Applications — Traditionally, business has anticipated 
certain circumstances, stakeholders may want to understand how ty Nd) hat data would be used to answer questions about what to do and when to do 
distinct data elements are related, or the relationship between social ма it, Data has often been integrated intg general-purpose business applications. 
activity and growth in sales. This is not the typical query the company M With the advent of big data, this is changing. Now, the companies are 
ask of a structured, relational database. A graphical database might be а W secing the development of applications that are designed specifically to take 
choice, as it may be tailored to separate the “nodes” or entities f bet advantage of thc unique characteristics of big data. Specific emerging 
"properties" or the information that defines that entity, and the "edel applications include areas like healthcare, manufacturing management and traffic 
relationship between nodes and properties. Using the right database ы (кокк АП ot these mee pi debes dese cipe ape 


improve performance. Typically, a graph database may be used in sci varieties Hun iris n ше Бера ГЕ a йаш РО башын 
and technical applications. ‘| healthcare, а big data application mi ight be able to monitor premature infants to 
determine if data indicates when intervention is needed. In manufacturing, a 
(vi) Organizing Data Services and Tools — Indeed, not all the à; | big data application can be used to prevent a machine from shutting down 
that organizations use is operational. A growing amount of data comes from’ during a production run. A big data traffic management application may reduce 
number of sources that are not quite as organized or straightforward, includi thc number of traffic jams on busy city highways, decreasing the number of 
data that comes from machines or sensors, and massive public and priv jj accidents while saving fuc! and reducing pollution. 
data sources. In the past, most companies were not able to either capturet Q.16. What do you inean by big data analytics ? Explain various types 
store this vast amount of data. It was simply too expensive or too overwhelmin) of analytics. 
Even if companies are able to capture the data, they do not have the tools tol; 
anything about it. Very few tools can make sense of these vast amounts‘ 
data. The tools that did exist were complex to use and did not produce геш | v nown correlations, market trends, customer preferences and other useful 
within a reasonable time frame. In the end, companies who really wanted! business information. Then analytical findings can lead to more effective 
do the enormous effort of analyzing this data were forced 10 work Wl marketing, new revenue opportunities, better customer service, improved 
snapshots of data. This means that stakeholders may miss out on relev? operational efficiency, competitive advantages over rival organizations and 
events as they may not have been captured in a certain snapshot. ||other business benefits. 


(vii) Analytical Data Warehouses and Data Marts — After a coura , The primary goal of big data analytics is to help companies make more 
sorts through the massive amounts of data available, it is often pe | eye business decisions by enabling data scientists, predictive arene 
lake the subset of data that reveals patterns and put it into a form that's ek edet 4. с to analyse large е of — 
to the business. Such so-called ‘warehouses’ provide compression, MU з Эне, те as other forms of data that may be untappe уо соп Semen 
partitioning, and a massively parallel processing architecture. | usiness intelligence (BI) programs. That could include we server logs an 

А yp р g өтей Internet click stream data, social media content and social network activity 
най) Teports, text from customer e-mails and survey responses, mobile phone call 
d f asion | etai records and machine data captured by sensors and connected to the 

ata tells them about everything from monthly sales figures 10 Pr?) рии! Internet of Things. 
e Big data changes the way the data is managed and used. ас Big data burst upon the scene in the first decade of the 21st century, and 
wi de collect, manage, and analyze enough data, it may use 4 де ji ollect ‘the first organizations to embrace it were online and start-up firms. Arguably, 
ols to help management truly understand the impact not just ofa firms like Google, LinkedIn, eBay and Facebook were built around big data 


Ans, Big data analytics, is the process of examining large data sets that 
containing a variety of data types i.e., big data to uncover all hidden patterns, 


(viii) Reporting and Visualization — Companies have always 
оп the capability to create reports to give them an understanding of W 
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from the See gi ay n оше to шае ог integrate bigd | Introduction of Big Data 25 
ta | . 
more radiom о о : im S her = performed Upon them p W| — Forexample, prescriptive analytics can benefit healthcare strategic planning 
they did not have tha an ЯЗ ча Mong orms. They did no, have by using analytics to leverage operational and usage data combined with data 
big data technologies with their traditional IT infrastructures be © my of external factor such as economic dala, population demographic trends and 
infrastructures did not exist. Big data could stand alone, Ы; Саша 
› 


c в data © th opulation health trends, to more accurately plan for future capital investments 
could be the only focus of analytics, and big data technolo r 


BY an: "le uch as new facilities and equipment utilization as well as understand the 
could be the only architecture. archite 


Cty trade-offs between adding additional beds and expanding an existing facility 
Analytics can be classified into following three types — |versus building a new one. 


(i) Predictive analytics 
(ii) Descriptive analytics 
(іі) Prescriptive analytics. 


Q.17. Explain core components of analytical data architecture. 
[R.GP.V., May 2019 (ИШ-Ѕет.)] 


Ans. The big data storage and analytics platform provides resources and 
functionalities for storage as well as for batch and real-time processing ofthe 
big data. It provides main integration interfaces between ihe site operational ` 
platform and the cloud data lab platform and the programming interfaces for 
the implementation of the data mining processes. The internal structure of the 
big data storage and analytics platform is given in fig. 1.6. 


(i) Predictive Analytics — Predictive analysis establish Previousy 
patterns and gives list of solutions which may come for given id | 
Predictive analysis study the present as well as past data and predict What 
happen in future, give probabilities of what would happen. It is used оу! 
big data to forecast other data which we do not have. This analytical na 
is one of the most commonly used methods used for sales lead scoring, sxi) Functions 
media and consumer relationship management data. | Repository 


Functions 
Export 


| 
(а) Predictive modelling 


кер “ құ” i Messaging Distributed Data 
(b) Decision analysis and optimization | п Ес Processing Framework 


(c) Transaction profiling. | Distributed 


a enl | Database 
For example, predictive analytics is used for optimizing custom 
relationship management systems. They can help enable an organization! DH 


a ! Service 
analyze all customer data therefore exposing patterns that predict custo? 


i Distributed File System 
behaviour. У 


T: 75 з Y di 
(ii) Descriptive Analytics - Descriptive analytics also knowns") | | . 
mining, operates what is рше їп агаш: It is one of the simplest tP, Fig. 1.6 The — ык så = Big Data Storage and Analytics 
i i i 'atfori 
of analytics as it converts big data into small bytes. The result is monitor = | Si = | 
through e-mails or бай, It is used by majority of organizations. | Data are primarily stored in the distributed file system, which is resi 
For example, descripti | мі ines hístorical electricity usage d for the distribution and replication of large datasets seis er min 
> riptive analytics examines historica. 2 а 2 ata i vidë e di 
to help plan power dh and all : icc i et optimal prices (data nodes). A unified 80655510 the structured data Еру y t responsible 
allow electric companies to set ОР database using the standard SQL interface. The main component respons! 


criptive data analytics goes one « 
nultiple actions with likely outcomes for each d€C stored in the functions repository, where they are available for production 
alytics is not preferred much by organizations, 
We result if used correctly. 


an 
can show impress 
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processes. The rest of the components (messaging service ang dat 


service) provide data communication interfaces, and connect th 
t Decision trees illustrate the strengths of relationships and dependencies within 


platform to the data lab platform. 
The big data storage and analytics platform consist of the follow; ‘(lata and are often used to determine what common attributes influence outcomes 
Nes 
Л 
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buch as disease risk, fraud risk, purchases and online signups. The structure of 


components — 
{he decision tree reflects the structure tha 


(i) Distributed File System — Provides a reliable, sca] abl | 
with similar interfaces and semantics to access data as loca] file е file қ (ій) In Banking - The use of customer data invariably raises privacy 
(ii) Distributed Database — Provides a structured yje Sy Stems jssues. By uncovering hidden connections between scemingly unrelated pieces 
stored in the data lab platform using the standard SQL la Jew of theg pf data, big data analytics could potentially reveal sensitive personal information. 
nguage, and Su Research indicates that 62% of bankers are cautious in their use of big data 


standard RDBMS programming interface: N, ; : 
рев £ 5 such as JDBC for Piue to privacy issues. Further, outsourcing of data analysis activities or 


Java 
ваи "x Оо сінод of customer data across departments for the generation of richer 
(iii) Distributed Data Processing Framework — Allows th sights also amplifies security risks. For instance, a recent security breach at 
of applications in multiple nodes in order to retrieve, classify or i к, leading UK-based bank exposed databases of thousands of customer files. 
arriving data. The framework provides data analytics APIs on ‘Although this bank launched an urgent investigation, files containing highly 
i i 0 mj itive i i ch as customers’ earnings, savin 
paradigms for processing large datasets — API for parallel computation | pum ае ир en "E dns ean {з ма 


API for distributed computation. а insurance po 
‘concerns about data privacy and discourage customers from sharing personal 


mee МА ннан Repository — Provides storage for predictive білесің information in exchange for customized offers. 
g wi settings required for the deployment of functions, | (iv) In Marketing — Marketers have begun to use facial recognition 


(v) Messaging Service — Implements an interface for rea i Software to learn how well their advertising succeeds or fails at stimulating 


communication. between the data lab and operation platforms. It provida interest in their products. A recent study published in the Harvard Business 
publish-subscribe messaging system for asynchronous real-time two» Review looked at what kinds of advertisements compelled viewers to continue 


communication, which allows to the decoupling of data providers and consume atching and what turned viewers off. Among their tools was “a system that 
nalyses facial expressions to reveal what viewers are feeling.” The research 


(vi) Data Replication Service — Provi i | 5 
of the historical bit dn Betweensthe га Ар interface e loud was designed to discover what kinds of-promotions induced watchers to 
ab and operation platform. („раге the ads with their social network, helping marketers create ads most 
likely to "go viral" and improve sales. 
(v) In Smart Phones — Perhaps more impressive, people now carry 
facial recognition technology in their pockets. Users of iPhone and Android 


Q.18. Explain the application of big data analytics in various fields, 


Ans. Big data analytics applications (ВРА Аррв)-аге a new category! 
software applications that leverage largescale data, which is typically too Іші 


a simpl @ In Clustering — Using clustering (K-means algorithm) О. 
da and 5 dialog, users can automatically find grouP 
: оп speci i : : -o iti si nx 
identify pecific data dimensions. With clustering, it is then p roduc’ (vi) In Telecom — Now-a-days big data is used in different fields. In 


and ad ; 
patient г dress groups by customer type, text documents, telecom also it plays a very good role. Service providers are trying to compete 
jn the cut-throat world of telecom services. Where more and more subscribers 


М 

cords, click path, behaviour, purchasing patterns, etc. à 

users Ns In Data Mining — Datameer's decision trees automaticaly т ely on over-the-top (OTT) players as providers of value-added services are 
stand what combination of. data attributes result in a desired 00 


eases: 


| 
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focused on increasing revenue, reducing opéx, 
customer experience as key business objectives. 


Operators believe that big data and advanced analytics will 
role in helping them meet their business objectives, In the 
respondents indicate critical use case scenarios in the context o 
advanced analytics where they are investing now and wherg 
invest in the next three years. 

Operators face an uphill challenge when they need 10 РИ 
compelling, revenue generating services without overloading their 
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(vii) In Agriculture — A biotechnology firm uses "M 
optimize crop efficiency. It plants test crops and runs simulations 
how plants react to various changes in condition. Its data environmen 
adjusts to changes in the attributes of various data it collects 
temperature, water levels, soil composition, growth, output, 
sequencing of each plant in the test bed. These simulations allow it t 


the optimal environmental conditions for specific gene types, 
| 


0.19. Explain advantages of using big data analytics in health 
sector and banking sector. | 


Ans. Advantages of using Big Data Analytics in Healthcare Sect 
Advantages of using big data analytics in healthcare sector are as follows! 
D { 


@ For Research — The large amount of data produced, gives g 
Opportunity to researchers in fields of health informatics, by using tools t 
techniques for unlocking the hidden patterns. 


(ii) For Individual’s/Patients — In deciding any line of treatment 

a patient, historical data about the symptoms, drugs, outcomes, responses! 

different patients is taken into account. The move is towards formulatia! 

(patient on personalized treatment) on the genomic data, locality, area and lifest! 
Tesponse to certain medicines, allergy, and family history. When genome ш 
known completely, some kinds of relations are established between tbe DM 
and the disease. Then specific treatment is formulated for such small gro 
individual. The patient gets advantage by various ways such as correct asw 
as effective line of treatment, better health related decisions, preventive Eu 


time, continuous health monitoring of patients by wireless devices, desig" 


personal line treatment, increase life quality and expectancy. 


(iii) For Hospitals — A . nav 
in hospitals ospitals — By various techniques and tools of BDA dai 


longer time 
hospitalization, 
Various queries 


!^ ike number of patients not getting cured at early Stage, numbe; 
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i he given treatment ? If С 

respond positivcly lo the р Surgery done, will the pati 
to it? Will the patient get prone to catch disease in near future ? анын ды 
The hospital management and administration can lake better decisions 
t of readmission 
creasing number 
t post treatment 


increasing because of paticnt getting ill again after treatment, in 
of staffs on floors for efficient working, plans for frequen! 
follow ups, etc. 

(iv) For Insurance Companies — Government for giving medical 
claim to patients do large amount of expenditure. By using BDA analysis. 
prediction and minimizing fraud medical claims can be done. | 

6) For Pharmaceuticals — BDA techniques help R&D to produce 
drugs, instruments, tools etc. in shorter period of time, which are effective in 
treating specific diseases. 

(vi) For Government — The demographic data, historical data of disease 
outbreak, weather data, and data from social media over diseases like cholera, 


| flue etc. information is used by the government, Government analyzes this massive 


data to predict epidemics, by finding correlations between weather and disease 
and accordingly preventive measures arc taken. Public health surveillance is 


||improved as well as the response to disease outbreak is quick by using BDA. 


(vii) For Pharma Companies — To improve workflow quality and 
quantity, like predictive modeling, statistical tool and algorithms. These improve 
the outcome of experiment and provide better understanding of developing 
drugs pharma companies need new tools. This tool successfully navigates the 
regulatory approval and marketing process. 

Advantages of using Big Data Analytics in Banking Sector — 
Advantages of using big data analytics in banking sector are as follows — 

(i) Sentiment Analytics — Continuously monitoring of customers 
opinion is needed from banks. Banks nced to identify which are their kcy 
customers and by their feedback they need to improve their flaws in system. 
This lead to increase in their productivity and services. 


(ii) Changes in Service Delivery — Whencvera reputation or account 
range enters into system, it checks through all the information and provide 
esired information. This allows banks to map work process, save time and 
Prices. Huge information and its proper knowledge allow organization to identily 
and solve issues before they affect their customers. 

(iii) Fraud Detection and Prevention — One of the most important 
Obstacles faced by banking sectors is fraud. Big data ensures that no 


unauthorized transactions are done and provides security as well as cdd 
the entire System. 


y X 
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(iv) Enhanced Reporting — Getting access to hy 
also contain different needs of different customers, Then t, ато 
needs іп а meaningful way. Banking industry provides the banks 
required by the customer by using big data. 

(v) Risk Management — Early detection of fraud į 
of risk management. Large amount of information does the may; 
of risk management as it will identify fraud. Massive informar, t 
role in desegregation of banks needs into a centralized frase play 
which possibilities of losing the information is reduced, al pla 

(vi) Customer Segmentation — By identifying ша 
customer, loyalty programs are created. Targeted marketing 
made as well as relationships are build between valuable custo "Aia 

(vii) Examine Customer Feedback — Customers eoa | 
collected in text form from various social media sites and afier e 
they are classified into positive and negative. This is used to provide e | 
customer. ты 
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0.20. Explain open-source technology for big data analytics. 


vice president o f marketing at Talend, a provider of o 
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the data cleansing and the data type translation. If the data was not 100% 


(clean W | 
Бе of сац get it to a consis 
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ons. You do not necd to prove to a vendor that you кашек 
your budget. With open source, you can try it and a iri cae 
own 

One disadvantage of open-source is that it has to coexist with th 
jetary solution for a long time for many reasons. For example. uis 
Hadoop to a database required a Hadoop expert in the middle io is 


hich is the case with most circumstances) a developer was needed to 
tent, proper form. Besides wasting the valuable time of that 
xpert, this process meant that business analysts could not directly access 
nd analyze data in Hadoop clusters. SQL-H is software that is developed to 
solve this problem. 
0.21. What are the desired properties of big data system ? 
Ans. The desired properties of big data system are as follows — 

(i) Error Tolerance and Robustness — Because of the challenges 
distributed system, it is very much difficult to build a system 


Ans. Open-source software is computer software that is available in soz 
code form under an open-source license that permits users to study, chan, 
and improve and at times also to distribute the software. The орелот 


| 
% 
announcement of a source code release for Navigator (as Mozilla). 


| 


Although the source code is released, there are still governing бобе: 
ple is the GNU Ge the growing data and load by adding resources to the system. The lambda 


agreements in place. The most prominent and popular exam 
Public License (GPL), which "allows free distribution under 
further developments and applications are put under the same lic: 
that the products keep improving over time for the greater pop 


Some other open-source projects are managed and supported by comm? 


companies, such as Cloudera, that provide extra capabilit 
professional services that support open-source projects such as Hadoop: 
similar to what Red Hat has done for the open-source project Тан, 
One of the key attributes of the open-source analytics stack ist 
not constrained by someone else's predetermined ideas or vis 
Champagne, chief technology officer at Revolution Analytics, 
advanced analytics. The open-source stack does not put you ime 
You can make it into what you want and what you need. If you ©° 
an idea, you can put it to work immediately. That's the advantage 9 
Source stack — flexibility, extensibility, and lower cost. 


of the great benefits of open-source lies in the fle 
You download and deploy it when you need it, sal 


the condition 


a prov 


xibility of 


mod d Yves de 


t| 


ense.” ThiseiS! Scaling is achi 
ulation of bu 


ies, training Чо all applications, whether financial management systems, soci 
Тігіс, because the lambda architecture is based functions of all data. 


hat k 
ion, says РЛүгасе, for each value in the system, 
a strailj nature of the batch layer and by preferring to use recom 


me UP “When possible. 
рео) 


that "do the right thing". Systems arc required to behave in a right manner 
despite machines going down randomly, the complex semantics of uniformity 
in distributed databases, redundancy, concurrency, and many more. These 
challenges make it complicate even to reason about the functioning of the system. 
Robustness of big data system is necded to overcome the complexities associated 
with it. 


(ii) Scalability — It is the ability to maintain the performance with 


1 
а 


architecture is horizontally scalable across all layers of the system stack i.e. 


eved by including more number of machines. 
(iii) Generalization — A wide range of applications can be function ina 


general system. As lambda architecture is based on function ofall data, it generalizes 
al media analytics 


| pi Я 
(iv) Debuggability — A big data system must provide the information 


quired to debug the system when things go wrong. We should be able to 
exactly what caused it to have that value. 
hrough the functional 
putation algorithms 


Debuggability is achieved in the lambda architecture t 


22-0) Ad hoc Queries — The ability to perform ad hoc queries on the 
fata is significant. Every large dataset contains unanticipated value in it. Having 


НӨ, ЦЕ 1; 
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the ability of data mining arbitrarily provides opportunities for ney, appli { 
and business optimization. Plic in, 
(v) Extensibility — Extensible system enables function 10 be 
cost effectively. Sometimes, а new feature ога change to an already à 
system feature needs to reallocate pre-existing Cata mite ew data ы 
Large scale transfer of data becomes easy їп an extensible System, om 
(vii) Minimal Maintenance - Maintenance is the work re 
keep a system running smoothly. This includes anticipating when \ 
machines to scale, kecping processes up and running and debugging ны 
that goes wrong in production. In order to have minimum m Mà 


м s üinten, - 
components with low implementation complexity should be selecteq Tar 


(viii) Low Latency Reads and Updates — Large number of applica 
need low latency reads, within a few milliseconds to a hundred milliseç 
While, the update latency requirements may vary widely. In some applica 
updates need to propagate immediately, but in other applications, update m. 
of few hours is allowed. | Q.I. What is Hadoop ? 


HADOOP · 


Quir, 


NTRODUCTION TO HADOOP, CORE HADOOP COMPONENTS, 
HADOOP ECOSYSTEM, HIVE PHYSICAL ARCHITECTURE, 
HADOOP LIMITATIONS, RDBMS VERSUS HADOOP 


| Ап. Hadoop was developed in the year of 2005 by Doug Cutting and 
like Caferella: It is the Apache open source software which allows to store 
nd process the huge volume of data in a distributed environment] and it is 
fritten in java: Hadoop is also called MR1. The major social networking sites 
uch as Facebook, Yahoo, Google, Twitter and LinkedIn uses the Hadoop 
ichnology to process their huge volume of data! 


| fi is mainly designed to scale up from a single machine to thousands of 


12. 
i failure.| Hadoop consist of two main frame work Map reduce layer and 


[DFS layer; Map reduce layer is used for processing the big data (wherethe 
Serapplicatión executes) and|HDFS is used to store the big data\(where:the 
Ser-data-resides). i 


Q.2. Explain main components of Hadoop. 
Ans. Two main components of Hadoop are as follows — 


) (i) Тһе Hadoop Distributed File System (HDFS) SHDFS is the 
lorage system for a cluster] When data lands in the cluster, HDFS breaks it 
Yto pieces and distribute those pieces among the different servers participating 
І the cluster. Each server stores just a small fragment of the complete data set 
04 each piece of data is replicated on more than one server. 


Computer Cluster 


DFS Block 1 | prs Bosk 1] 


DFS Block 2 
a 
dat, 


DFS Bloc! May 1 
eayebsite hadoop.apache. & j 


ad hich function different to cach other's. Some of the widely used Hadoop 
| Map | = оен» аге as follows — 
2. | \ (i) Pig-W is a platform for HDFS. И consists of a compiler for 
(рез Biock 3] XI] ptos programs and a high-level language called Pig Latin. It provides a 


y to perform data extractions, transformations and loading, and basic analysis 


DFS Blo А i 
[Drs seas] Пош having to write MapReduce programs. 


| (ii) Hive-ltisa distributed data warehouse. A data warehouse and 


Fig. 2.1 HDFS & Map Reduce QL-like query language that presents data in the form of tables. Hive 
fogramming is similar to database programming. (It was initially developed 


(ii) Map Reduce — Because Hadoop st ire ы! 
p stores the entire day Facebook). 


small pieces across a number of servers, analytical j 
, al jobs can be distri к , А - 
parallel to each of the servers storing part of the data. Each ок (iii) HBase - Ít isa non-relational, distributed database that Tuns on 
tip of Hadoop. HBase tables can serve as input and output for MapReduce jobs. 
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| The Hadoop's parallel world has following two major layers — 

| (i) Processing/computation layer is called MapReduce 

| (ii) Storage layer is called Hadoop Distributed File System (HDFS) 


Data 
data data data data d 
datu data data duta data 
data data data data data 


0.4. Explain the ecosystem of Hadoop. 
^| Ans. Hadoop is an open source framework maintained by the Apache 
ndation for reliable, scalable and distributed computing. According to the 


data data data data data 
date data data data data 
data data data data data 


Set 
duty 
Aat 


ui 


data dats data data data 
data data data data data 
data duta data data data 


data datu data data data 
data data data data data 


the question against its local fragment simultan i 
eously and ге | : b: à 
back for collation into a comprehensive suero Re жы (iv) Zookeeper - It is an application that coordinates distributed 
that distributes the work and collects the results. Both HDFS and Mahl 
| (v) Mahout - Маһош is а data mining software that can be easily 


are designed to continue to work even if there are failures. HDFS contin) « 
monitors the data stored on the cluster. If a server becomes unavai lable. Mahout offers java libraries or scalable machine learning algorithm 
disk drive fail data i ‘hich can be used for analyzing the data. These machine learning algorithms 
ive fails or data is damaged due to hardware or software prot Р : : en 
HDFS aut А 7 ow user to perform a task such as classification, clustering, association rule 
же 2 omatically restores the data from опе of the known good alysis, and predictive analysis 
stored-elsew ipl i p: Я 
анага ede the cluster! MapReduce monitors the progress (vi) Cassandra — Hadoop Cassandra provides database that can be 
сае 4 pinea z the job, таш visi. job i anri ly scalable and highly available without interruption in the job performance. 
1 an answer or fails before comp (vii) Chukwa — Chukwa is a data collections system which is mainly 
MapReduce automatically starts another-instance-of the task 00 ied for displaying, monitoring, and analyzing the outcomes ofthe collected data. 
E that has a copy of the data. s \ (iii) Spark — Spark is a computing system whicl. is used for 
ecause of the way that HDFS and MapReduce work, Hadoop Ponfiguring the Hadoop cluster for fast processing of Hadocp data. Spark 
scalable, reliable and fault-tolerant services for data storage and aviliges not use MapReduce job of execution engine to run the job. It uses its 


u 
very low cost. | 


Wn distributed runtime to complete the job. 
. | (ix) Tez — Tez is а data-flow programming language build in the 
0.3. Write short note оп Hadoop's parallel world. Ae Yarn to execute an arbitrary DAG of tasks to process data for both 


Ans, The Hadoop framework application works in an е atch and interactive use-case. 
provides distributed Storage and computation across clusters 0 ot (x) Avro — Avro is used for data serialization 
Hadoop is designed to scale up from single server to thousands of ™ ontainer file for storing persistent data. Avro was create 


each offering local computation and storage. | 


which provides a 
d by Doug Cutting 
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| 
i 


for making Hadoop to be writable in many programming | 
C++, CH, Java, JavaScript, Python, Ruby. в апана, | 
И ч А 5 Sug Hadoop 37 
(xi) Ambari — It is a web interface for managin, \ (b) RAM Size — For using Hadoop the recomm i 
testing Hadoop services and components. & NL is 4 GB or higher. ended RAM size 
(xii) Flume — It is a software that collects, а Y (c) Processor — Two or more core processors are needed fi 
large amounts of streaming data into HDFS. BIeBales ащ! installing Hadoop. d 


2 М | " М 
(xiii) Sqoop — It is a connection and transfer mecha (ір) Software Requirements — 


data between Hadoop and relational databases. nism Іш, (а) Java ТЭК higher 
(xiv) Oozie - It is а Hadoop job sched | (у Hber опе 
2 p P scheduler. (c) Eclipse or IntelliJ community version 
The Hadoop ecosystem is shown in fig. 2.2. | (d) Virtual machine and Cloudera (optional). 


| Hadoop is owned by Apache Foundation, and is available for free for 
| downloading. Java, Eclipse, and IntelliJ are also available for downloading for 
|| free. Students and researchers can use these software for free. The software 
| jg also available from commercial vendors which provide support when needed. 
Cloudera and Horton works are the companies which provides Hadoop supports, 
but they charge for service. Their free version can be used bat the software 
support is not available when needed. 

Once the requirements are meet, the Hadoop software can be installed 
for free of cost to get started with thc simple project. Later the software and 
hardware can be upgraded to work on more complex project with bigger 


volumes and variety of big data. 


Q.6. What is Sqoop ? Also write its advantages. 
, Ans. Sqoop is mainly used to transfer the huge amount of data between Hadoop 
and relational database. Sqoop refers "SQL to Hadoop and Hadoop to SQL”. It 
| imports the data from the relational database such as Mysql, Oracle, postgreSQL to 
л the Hadoop (HDS, Hive, HBase) and exports the data from HDFS to relational 
database. The non-Hadoop I 
5 data store can also bc 7 Sqoop Tool 7, 
0.5. What are the system requirements for installing Hadoop | extracted and transformed to i 

Ans. There is various application which can be used for the analy) Hadoop data store. The р 
Hadoop вой! Extraction, Transformation 
and Loading (ETL) can be 
performed by using Sqoop. It 
is the open source framework 
apache! | of cloudera Inc. The data can 
ter Hat”) be imported and exported in a - | 
| parallel manner. Fig. 2.3 SQOOP Transformation 


Advantages of Sqoop are as follows — 
(i) Hardware Requirements — (i) It offers the migration of heterogeneous data. 


(а) Operating System — Hadoop project can be (ii) Itoffers easy integration with Hive, HBase and оолїе. 


the Linux or Windows operating system. Windows 10 version of | (iii) We can import the whole database or the single table in to HDFS. 


system has been found to be most efficient. 
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Fig. 2.2 The Hadoop Ecosystem | 
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0.7. What is zookeeper ? Also write its advantages аң, d di 
Ans In a traditional distributed environment coordinating м | 
iask is quite complex and complicated. But the zookeeper overco nd Ran, қ | 
with the help of simple architecture and its АРІ. In the cluster en this Pri | 
| to maintain the shared data and coordinating among бее efi 
zookeeper as a service to provide robust synchronization, It б Ves they V 
such as naming service (identify the name ofthe node in the chutes) aS $e 
management (Up to date information is maintained), cluster dian, 
status ofthe node leaving or joining in the cluster), leader election (for emen 
among themselves they elect a single node as leader in the Cluster) p ing | 
synchronizing service (when апу data can be modified in the cluster оар 
particular data to provide consistency) etc. The inconsistency of = j 
condition and deadlock problem in the traditional distributed environme, ta, | 
solved easily with the help ofzookeeper mechanism such as Atomicity, x | 
property and synchronization respectively. í < 
Advantages of Zookeeper — | 
(i) Itprovides reliability and availability of data. 
(i) It offers high synchronization and serialization. 
(ii) The atomicity eliminates the inconsistency of data among clust: 


(iv) It is fast and simple. | 


u 
tl 


Disadvantages of Zookeeper — 
(i) The large number of stacks needs to be maintained. 


Q.8. What is Mahout ? Give its advantages and disadvantages. 


provides data mining library. The processing task can be split in to mulie 
segments and each segment can be computed on a different machine in ot 


| (i) 


Ans. Mahout is the Apache open source software framework and.) 
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Disadvantages of Mahout — 
(i) It doesn’t support scala version in the development 
(ii) [t has no decision tree algorithm. 


Q.9. What is Oozie ? Also write its advantages and disadvantages. 

Ans. Oozie was initially developed at Yahoo for their complex workflow 
search engine. Later it was acquired by open source Apache incubator. Itisa 
workflow scheduler for managing Hadoop jobs. There are two major types of 
oozie jobs are available, i.e. oozie workflow and oozie coordination, In the 
oozie workflow it follows Directed Acyclic Graph (DAG) for parallel and 
sequential execution of jobs in the Hadoop. It contains the control flow node. 
The control flow node controls the beginning and end of the wo.kflow 
execution. In the oozie coordination, workflow jobs are triggered by tıme. 


Advantages of Oozie — 
(i) It allows the workflow of execution can be restarted from the 
failurc. 
(ii) It provide web service API (i.e. we can control the jobs from 
anywhere). 
Disadvantages of Oozie — 
It is not a resource scheduler. 
(ii) It is not suitable for off grid scheduling. 


Q.10. Give some applications of Hadoop. 

Ans, Now-a-days, with the rapid growth of the data volume, the storage 
and processing of Big Data has become the most pressing needs of the 
enterprises. Hadoop as the open source distributed computing platform has 
| become a brilliant choice for the business. The users can develop their own 
distributed applications on Hadoop and processing Big Data even if they do 


to speed up the computation process. The primary goals of the Mahoi 
data clustering, classification, regression testing, statistical modeling # 
collaborative filtering. It provides scalable data mining and machine ke" 
(it makes the decision based on the current and previous history oté 
approaches for the data. 


Advantages of Mahout — 


(i) It supports complementary and distributed n 
classification. 


aive bay 


(ii) It mines the huge volume of data. 


__ iii) The companies such as Adobe, Twitter, Foursquare: 
and Linkedin internally uses Mahout for data mining. 


B (iv) Yahoo uses it for pattern mining. 


рос" 


not know the bottom-level details of ће system. Due to the high performance 
| of Hadoop, it has been widely used in many companies. Some applications of 
Hadoop are given below — 

(i Hadoop in Yahoo! — Yahoo! is the leader in Hadoop technology 
research and applications. It applies Hadoop on various products, which include 
{ the data analysis, content optimization, anti-spam e-mail system, and advertising 
| optimization. Hadoop has also been fully used in user interests’ prediction, 
| Searching ranking, and advertising location. 
| In the Yahoo! home page personalization, the real-time service system 
! Will read the data from the database to the interest mapping through the Apache. 
| Every 5 minutes, the system will геагтапре the contents based on Hadoop 
| cluster and update the contents every 7 minutes. 
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For spam e-mails, Yahoo! uses the Hadoop cluste 


Ts T log 
the Yahoo! will ; Соге 

s к "s раша и : s ia ki = жые the anti-spam елш йар ой 

e Hadoop clusters and the TS wi push Шоп times Ofe "Ай В . | р 
every day. At present, the largest application of the Had maint | (ii) Hadoop handles а variety of workloads, including search, tog 
Webmap of Yahoo!. It has been run on more than 10000 Li wi IS thy ы processing, €—À quem, data warehousing and video/image analysis. 

Чебер "es (ш) Apache Hadoop is an open-source Project by the A 
pache 


(ii) Hadoop in Facebook — Facebook is the largest 
in the world. From 2004 to 2009, Fa 5 кыны, 
rad : EE cebook has over 800 millio 

e data create everyday is uge. This means that Facebook ‚е, Unlike traditional, structured platforms Hadoop is able to Store any kind of 
problem with big data processing which contains Content mainta; 5 іші, data іп its native format and to perform a wide variety of analyses and 
sharing, comments, and users access histories. These data anang y transformation on that data. Hadoop stores terabytes and even petabytes of 
process so Facebook has adopted the Hadoop and Hbase t are not чу data inexpensively. It is robust and reliable and handles hardware and system 

© handle it | failures automatically without losing data analyses. 
Я A mos | (iv) Hadoop runs on clusters of commodity servers and each of those 
Ans. As Facebook is developing, it discovered that MySQL Сапыш servers has local CPUs and disk storage that can be leveraged by the system. 
1 


*. its requirements. After long- i i 
Fil Hadoop ani in bi TE and experiments, Facebook p! Hadoop Architecture— Hadoop is an open-source framework that allows 
F. р e as the data processing system. The Teas, users to store and process big data in a distributed environment across clusters 
acebook choose the Hadoop and Hbase has the two Y rii ae fe desi 

А аѕресіѕ. Оп the one}. of computers using simple programming models. И is designed to scale up 
Hbase meets the requirements of Facebook. Hbase can support the rapid xj from single servers to thousands of machines 
to the data. Although Hbase does not support the traditional outer form оре with high degree of fault tolerance. Data in a (2 Набор 
the Hbase column oriented Storage model brings high flexibility search jn, Hadoop cluster is broken down into smaller [шке |] 

. шпег form. Hbase is also a good choice for intensive data. It is able to mig, Рі%се5 and distributed throughout the cluster 


г 
пар Software foundations. The software was originally developed by the world's 
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n ies largest Internet companies to capture and analyze the data that they generate, 


0.11. Why Facebook has chosen Hadoop ? 


huge data, support the complex index with the flexibl ili | like the Map and Reduce functions that are HDFC (Distributed Storage) 
е scalabil 

the speed of data access. On the other hand, Facebook has Pu Rem executed on smaller subsets of larger data 
solve the Hadoo; Ы. і жазы | sets, and this provides the scalability needed YARN Common 

‚ 26 забоор problems in real use. For now, Hbase has already been ti for big data processing. Framework Utilities 
ita high consistency and high throughput key-value storage bull Hadoop framework includes four 

n as the only manager node in the HDFS may become the bottle) models — Fig. 2.4 Hadoop Architecture 

У i i i ili | РТА 

e system. Then, Facebook has designed a high availability Namenod:, cJ, (i) Hadoop Common — They contain Java libraries and utilities 


AvatarNode to solve this problem. In the aspect of the fault tolerance, Wr that are required by other Hadoop modules. The Java libraries provide file 
ЖЫР rue and isolate faults in the subsystem of the disk. The failures“! system and Os level abstraction. It contains necessary Java files and scripts 
whole clusters of Hbase and HDFS are part of fault tolerance system. — | thatare required to start Hadoop. à 

Overall, according to the improvements by the Facebook, Hadoop“, (ii) Hadoop Yarn – YARN is a cluster management technology, It is 
meet the Facebook most requirements and can provide a stable, efficit! one of the key features in second-generation of Hadoop, designed from the 
же Facebook users, | experience gained from the first generation of Hadoop. YARN provides resource 
management and a central platform to deliver consistent operations, security 


| 
Ре 0.12. What are the ad ; chite 
a amd fo оннан: сат Hadoop ? ,. v JES Ж and data governance tools across Hadoop clusters. МЕ 
Апз. Ай Proper diagram, [R.GE., May 20 | (iii) HDFS (Hadoop Distributed File System) – lt is а distributed file 
Р vantages of Hadoop — ші System that provides high throughput computing access to application data. 
) The scalability and elastici Hadoop iv) Hadoop MapReduce — For large scale data processing this is. 
on standard hardware all icity of free open source ail (iv) оор MapReduce g 


ow organizations to hold onto more data p! programming model. 
mi 


advantage ofall thei i i i 
edge. Hadoop s eir data to increase operational efficiency and gain of do Components of Hadoop - Refer to Q.4. 
us P Supports complex analysis across large collections 0 

nth the cost of traditional solutions 
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Q.13. Explain the Hive physical architecture, 


Ans. Fig. 2.5 shows the major compon 
- E i 
with Hadoop. The main components of Hive хя E cis 


JDBC and ODBC. 


The Hive Thrift Server exposes a very simple client AP] 

statements. Thrift is a framework for cross-language eir to Execute Hj, 

written in one language (like Java) can also Support Шад талы tl 

The Thrift Hive clients generated in different languages are d к lang, | 
о build com, 


drivers like JDBC (java), ODBC (C+), and scrint; д 
регі еіс. ), and scripting drivers Written ің 


М 
Ё 

The Driver manages the life cycle of a Hiv 
optimization and execution. On receiving the 
server or other interfaces, it creates a session 
track of statistics like execution time, numb 


eQL statement durin 4, | 
HiveQL statement, жиле, | 
handle which is later used ba 
er of output rows, etc. |! 
Hive > load data localInPath ‘/home/hadoop/file.txt? | 
Select Command: 


into table student 


Hive>select*from students; | 


Thrift 
Server 


Driver 
(Compiler 
Optimizer 
Executor 


i 
i 


Fig. 2.5 Hive Physica] Architecture 
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There are 2 types of tables in Hive viz, managed table 


(i) Managed Table— Managed table is likean 
data can be stored and queried on. On dropping these ae 
Creating an Internal Table - 


CREATE TABLE STUDENTS (гой number INT, 
name STRING, age INT, address STRING) 
ROW FORMAT DELIMITED 
FIELD TERMINATED BY *,'; 
(ii) External Table — 
CREATE EXTERNAL TABLE STUDENTS (roll number 
INT, name STRING, age INT, address STRING) 
ROW FORMAT DELIMITED 
FIELD TERMINATED BY *,' 
LOCATION" 


ROW FORMAT should have delimiters used to terminate the fields and 
lines like in the above example the fields are terminated with comma (“,”). 


0.14. Give the limitations of Hadoop. 
Ans. The limitations of Hadoop are as follows — 


and external table, 
database table where 
data also lost forever. 


(0) Security Concerns— Hadoop is missing encryption at the storage 
and network levels, which is a major limitation from government agencies and 
others organizations point of view that prefer to keep their data under wraps. 

(ii) Vulnerable by Nature — Speaking of security, the very makeup 
of Hadoop makes running it a risky proposition. The framework is written 
almost entirely in Java. It has been heavily exploited by cyber criminals and as 
a result, implicated in numerous security breaches. For this reason, several 
experts have suggested dumping it in favor of safer, more efficient alternatives. 

(iii) Not Fit for Small Data — Due to its high capacity design, the 
Hadoop distributed file system lacks the ability to efficiently support the random 
reading of small files. As a result, it is not recommended for organizations 
with small quantities of data. 


0.15. Differentiate Hadoop vs distributed data base. 
ІЕ-СР.М, May 2019 (ҮШ-бет.)] 


Ans. Differences between Hadoop and distributed data base are as follows — 


тт [RDBMS | шер | 


Туре of data Structured data with Unstructured and 
known schemes structured 

Records, long fields, Files 
objects, XML 


Data groups 
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Data modification | Updates allowed 


Data loss is not 


requirement 
Evolution 

(for 2016) 

Data processing 


30+ years of innovation 


5-8 years old technolog, 


Streaming access 
. files 
Small numberof сот 


Batch processing 


to full 


| 
as it 
requi 


‚ HADOOP DISTRIBUTED FILE SYSTEM, PROCESSING DATA 
, WITH HADOOP, MANAGING RESOURCES AND APPLICATIO! 
WITH HADOOP YARN, MAPREDUCE PROGRAMMING 


0.16. Discuss in detail about Hadoop Distributed File System (HDI 
[R.GB., May 2019 (VlI-Sen) 


Ans. HDFS also known as Hadoop Distributed File System is one of 
Hadoop components which handles the storage of big data. When users 1* 
to add more storage in the System, 
then they can easily increase the 


Storage capacity by adding ced is 

Servers. HDFS consist of number wes 

of clusters depending upon the - 

user configurations. The cluster iens I" 
N 


Consists of Master and Slave 
Nodes. The data in the Hadoop 
Cluster are broken into many small 
blocks which are 128 MB sizes 
by default. These blocks are 


Fig. 2.6 Hadoop Cluster Node 


the different slaves" nodes in the Hadoop clusters, 
n 


ete or re 


the form 0! 


| HDFS isa di кый 
. There аге many comm ч Rais 
sedie the differences between them are also obvious. HDFS is a high fault 


plerant syste 
ighput access to 


never 


m 


ata from the disk can be made significantly larger than 
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These blocks 


Programs SQL & XQuery red i calable and can be increased when needed, 
Access Quick response, highly 5 fil | 
dom access (DFS, users can create new hile, append content to the end of file, 
ran In H sci the file, and modify file attributes. In Comparison to traditional 


handling data, Hadoop's storage can be scalable at а Very low cost 


Data loss 
аре acceptable sometimes thod of uses commodity hardware. 
Security and. - | Yes Not yet cause viridi n of clusters. And cluster have Master node and Slave 
auditing - Hadoop 1$ the fig. 2.6. Master node is also known as name node which 
Encryption Yes Not yet ойс as ра the slave nodes. Beside assigning jobs to the slave nodes, 
Compression Sophisticated data Simple file compre, 2581815 p manages the file system namespace. All the details are store in the 
vnethiod compression pression [master по! iet image and edit log. Cluster have only one master node, where 
Hardware Enterprise hardware Commodity іні (опт of aa e multiple slave nodes. The function of slave nodes is to store data in 
‘areas it may f blocks and performed a job assigned by the master node. 


ibuted system which is suitable for running on the commodity 
Е je haracteristics in the existing distributed 


i fide high 
d relaxed the parts of the POSIX constraints to provid 
а the data so that 3t can be suitable to applymg on the big data. 


Acceptance Large DBA and application ; : "m 
development community, | using itin production mal 0.17. What are the features of HDFS ? List out the characteristics of 
widely used. startups. IHDFS. 


Ans. The Features of HDFS — HDFS is nota general-purpose file system, 


specific s of applications, it does not need all the 

а на dix baked file system. For example, security has 
been supported for HDFS systems. 
Characteristics — The characteristics of HDFS are as follows — 

(i) HDFS fault tolerance 

(ii) Block replication 

(iii) Replica placement 

(iv) Heartbeat and block report messages 

(v) HDFS high throughput access to large dataset. 


0.18. Why is a block in HDFS so large ? List out the areas where 
FS cannot be used. 4 inimize 
Ans. HDFS blocks are large compared to disk blocks, in order to min 


i the time to transfer the 
rnc does bo mate gait age the time to seek to the 


iple blocks 
rt of the block. Thus the time to transfer a large file made of multiple 


erates at the disk transfer rate. 
HDFS cannot be used in following areas — 
G) Low-latency data access 


(ii) Lots of small files у 
(iii) Multiple writers, arbitrary file modifications. 
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0.19. Explain the architecture of Hadoop di 
(HDFS). trib, 
eq p 


| 
йе ti meno. . тіс data 
node, w ile the Datanode is the slave node. Documents Ode - iltiple Ое reports the data storage lists io ihe Мар Ше Datanode: 
blocks in the Datanode. The default size of a data > аге Ке Datat А # е regularly so that 
be changed. If the files are less than a block d. block is вам he user can oblain the data by direct access to the Datanode. 
: ock data size, Hype ang, jenti It d i 
3 ! client is the HDEFS user. Ít can read and write the data th Қ 

к, e Ta The Namenode and the DA. м Apl provided by HDFS. While in the read and write process, фе m dep 

programs in the Linux operating system. шығ P ds to obtain the metadata information from the Namenode, and then the 
| un can perform the corresponding read and write operations, 


Q.20. Write short note on the followings — 
(4) Authority management of HDFS. 
(i) Limitations of HDFS. 
| Ans. (i) Authority Management of HDFS — HDFS shares a similar 
Datanodes Datanod authority system to POSIX. Each file or directory has an owner and a group. 
= The authority permissions for the files or the directories are different to the 
a [E] iN plication H jwner, users in the same group, and other users. On the one hand, for the 
= mu files, users are required the -r authority to read and the —w authority to write. 
Ün the other hand, for the directories, users need the -г authority to list the 
ent and —w authority to create or delete. Unlike the POSIX 


Rack 2 directory cont 


system, there is no sticky, setuid or setgid of directories because there is no 


concept of executable files in HDFS. 
; 2 

Fig. 2.7 HDFS Architecture || Gi) Limitations of HDFS - HDFS as the open source implementation 
The Namenode which is the manager of the HDFS is responsi. GFS (Google File System) is an excellent distributed file system and has 


management of the namespace in the file system. It will put all the fou any advantages. HDFS was designed to run on the cheap commodity hardware 


files metadata into a file i hines. Thi: the probabilities of node fail 
à system tree wh ae not on expensive machines. This means that the probabilihes ої node ure are 
files directories. At the same time aie к күн all the metió ity high. To give a full consideration to the design of HDFS, we may find 
relations between each file and th | nenode also saves the comePthat HDFS has not only advantages but also limits for dealing with some specific 
and the location of the data block. Раіапоё problems. The limitations of HDFS are as follows - 


place to store the real data in 
the system. How is not si 
ever, all the data is not stt; (a) High Access Latency — HDFS does not fit for the requests 


the hard drives but will be с 
data server of the required б ш С ы c bn Ie Which should be applied in a short time. The HDFS was designed for the Big 
The Secondary Namenode i - T: Storage and it is mainly used for it high throughput abilities. This may 
is only one Namenode in th E is a backup node for the Namen hy the high latency instead. Because HDFS has only one single Master system, 
obviously become the т adoop cluster environment, the Name pi the file requests need to be processed by the Master. When there is a huge 
failure ofthe Namenod weakest point of the process in the HDFS. vé pied of requests, there is an inevitable delay. Currently, there are some 
This is the реци ғы occurs, it will affect the whole operation of thes 5 ditional projects to address this limitation, such as Hbase uses the Upper 
alternative backu d y Hadoop designed the Secondary Namenoct "pata Management project to manage the data. 
computer and таз % E Secondary Namenode usually runs on азарае), (b) Poor Small Files Performance — HDFS needs to use the 
of the file кише mmunication at certain time interval to keep "gl iini) to manage the metadata of the file system to respond to the ae 
immediately i metadata with the Namenode so that it can recover 4nd return the locations so that the limitation ofa file size is determined by the 
= їп case some error happens pere In general, each file, folder, and block need to one the 150 
€ Datanode і ` psf P¥tes' space, In other words, if there are one million files and each file occupies 
Of the fault-tolerant Aie place where the real data is saved and а" И 4d One block, it will take 300 MB space. Based on the current technology, it is 


hanism. The files in HDFS are usually 9" 
г“ 


РЧ 
|o ——— ЕЕС ШЫЙ, ^w 
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possible to manage millions of files. However, when the fi 
the work pressures on the Namenode is heavier aud the t 
is unacceptable. 


iles extend tp 
ine OF etiving A] 


(c) Unsupported Multiple Users Write Per 


= b ч missi, 
HDFS, one file just has one writer because multiple users' wri Tons . 


te à 
are not supported yet. The write operations can only be added at Pini, 
file not at the any positions of the file by using the Append dioi" 9f 


0.21. Describe in detail about dataflow of file read in HDFS. 


Ans. To get an idea of how data flows between the client 
HDFS, the Namenode and the Datanode, consider the fig. 2. 
the main sequence of events when reading a file. 


interacting ,. 
Bw 
8, which ^ 


loy 


2: Get Block 


Distributed 
FileSystem 


FSData 
InputStream 


Fig. 2.8 Client Reading Data from HDFS 

The client opens the file it wishes to read by calling open( ) on the 
FileSystem object, which for HDFS is an instance of DistributedFileSystea 
(step 1). DistributedFileSystem calls the Namenode, using RPC, to determi 
the locations of the blocks for the first few blocks in the file (step 2). Fe 
each block, the Namenode returns the addresses of the Datanodes that ha¥ 
а copy of that block. Furthermore, the Datanodes are sorted according " 
their proximity to the client. If the client is itself a Datanode (in the caset 

a MapReduce task, for instance), then it will read from the local Раїапо®. 

22 The DistributedFileSystem returns а FSDatalnputStream to the client & | 
it to read data from. FSDataInputStream in turn wraps а DFSInputSires? 
which manages the Datanode and Namenode I/O. The client then calls read 
оп the stream (step 3). DFSInputStream, which has stored the Data" 
addresses for the first few blocks in the file, then connects to the first (close | 
Datanode for the first block in the file. Data is streamed from the риал 
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back to the client, which calls rcad( ) repeatedly on the stream (step 4). When 
the end of the block is reached, DFSinputStream will close the connection to 
the Datanode, then find the best Datanode for the next block (step 5). This 
happens transparently to the client, which from its point of view is just reading 
a continuous stream. Blocks are read in order with the DFSInputStream opening 
new connections to Datanodes as the client reads through the stream. It will 
also call the Namenode to retrieve the Datanode locations for the next batch of 
blocks as needed. When the client has finished reading, it calls close( ) on the 
FSDatalnputStream (step 6). 

One important aspect of this design is that the client contacts Datanodes 
directly to retrieve data, and is guided by the Namenode to the best Datanode for 
each block. This design allows HDFS to scale to large number of concurrent 
clients, since the data traffic is spread across all the Datanodes in the cluster. The 
Namenode meanwhile merely has to service block location requests (which it 
stores in memory, making them very efficient), and does not, for example, serve 
data, which would quickly become a bottleneck as the number of clients grew. 


Q.22. Describe in detail about dataflow of file write in HDFS. 


Ans. The case we are going to consider is the case of creating a new file, 
writing data to it, then closing the file. 


POLITE TTI Ve MUTA -— —m 


пашелойе | 


Pipeline of " 
Datanodes | | DataNode | 
datanode -, datanode |` |  datanode 


ee et tLe 


Fig. 2.9 Client Writing Data to HDFS 


The client creates the file by calling create( ) on DistributedFileSystem 
(step 1). DistributedFileSystem makes an RPC call to the Namenode to create 
a new file in the filesystem’s namespace, with no blocks associated with it 
(step 2). The Namenode performs various checks to make sure the file does 
not already exist, and that the client has the right permissions to create the file, 
If these checks pass, the Namenode makes a record of the new file; otherwise, 
file creation fails and the client is thrown an ІОЕхсерпоп. The 


50 Big Data 


DistributedFileSystem returns a In 


t à putFormat is also res 
the input splits and dividing them i 


ponsi 
nto records. The da 2 


za С бор 
І Н 1 ta is divideg; | Cea, | 
of splits (typically 64/128Mb) in HDFS. An input split is ies into п | 
that is processed by a single тар. nk of the ity, 
t 
InputFormat class calls the g 


etSplits( ) function and co 
each file and then sends them to the jobtracker, which o. Ps 


locations to schedule map taks to process them on the a eir st 
tasktracker, the map task passes the split to the createRecord 
on InputFormat to obtain a RecordRead 
loads data from its source and converts in 
by mapper. The default Inp 


Read 
er for that split. Th Om 


І € Кесогар 
to key-valu i 
utFormat is Т ^ oat c еы е 
value of input a new value and the associated key is byte offset, 

A RecordReader is little more than an i 
task uses one to generate record key-value pairs, which it passes to 
function. We can see this by looking at the 


Mapper's run( ) method 
public void run(Context context) throws IOExce 


setup(context); 
while(context.nextKeyValue( )) { 
map(context.getCurrentKey( ), context. 


} 
cleanup(context); 


} 


SDataOutputStream for the client to start writing data to. Just as in the 


read case, FSDataOutputStream wraps a DFSOutputStream, which handles 
communication with the Namenode. 


As the client writes data (step 3), DFSOutputStream splits it into packets, 

which it writes to an internal queue, called the data queue. The data queue ts 
consumed by the DataStreamer, whose responsibility it is to ask the Namenode 
to allocate new blocks by picking a list of suitable Datanodes to store the 
replicas. The list of Datanodes forms a pipeline — we will assume the replication 
level is 3, so there are three nodes in the pipeline. The Data Streamer — 
the packets to the first Datanode in the pipeline, which stores Ше pra 
forwards it to the second data node in the pipeline. Similarly, the second а 
node stores the packet and forwards it to the third (апа last) Datanode n 
pipeline (step 4). DFSOutputStream also maintains an internal queue ane A 
that are waiting to be acknowledged by Datanodes, called the ack queue 


: Е dged 
packet is removed from the ack queue only when it has been acknowledg 
by all the Datanodes in the pipeline (step 5). 


terator over records, and 


getCurrentValue( ), context); 


Я wing 
, f à Datanode fails while data is being written to it, then the o the 
actions are taken, which are transparent to the client writing the байа. 


the map 
the тар 


Б, 


0 
tasktrackerg, o* 
a 


hod 
Cader 
extluputFormat which teats at 


ption, InterruptedException, | 


| 


ARUM sed, and any packets in 
me are added to the front 
of the data queue so that Datanodes 
that are downstreara from the failed 
node will not miss any packets. The 

urrent block on the good Datan des 
e iven a new identity, which is 
co municated to the Namenode, so 


When the client bas finished writing 
data it calls close( ) on the stream (step 
6). This action flushes all the remaining 
packets to the Datanode pipeline and 
waits for acknowledgements before 
contacting the Namenode to signal 
that the file is complete (step 7). The 
Namenode already knows which 
blocks the file is made up of (via Data 
Streamer asking for block allocations), 
so it only has to wait for blocks to be 


minimally replicated before returning 
successfully. 


H $ .* i 
at the partial block on the failed = 2 | i P | 
тт de will be deleted if the failed ізі i$; 
т ніні recovers later on. The failed Н i i PE 
Шеше is removed from the pipeline зі HEN 
"e the remainder of the block's data =i. H 2 gis i " 
S write to the two good Datanodes E > s z | х 
inthe pipeline. The Namenode notices 5% 2, 1% 
that the block is under-replicated, and ESI 5 i Р 
it arranges for a further replica to be { E i 
created on another node. Subsequent --- č С ЕН 
blocks аге then treated as normal. H Ы H 
т 
== 


Format 


Input 
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Q.23. What is the Google file system ? Explain architecture of G. 


TT for 
Ans. The Google File System (GFS) is a pisce 1. 
large distributed data intensive applications. It provides Lo : high aggregate 
running on inexpensive commodity hardware, and it de pila file system 
Performance to a large number of clients. GFS provides 3 РО$ГХ. Ейез аге 
interface, though it does not implement a standard API ан жеті GES support 
Organized hierarchically in directories and identified by pat и and write files. 
the usual operations such as create, delete, open, close, read, 


| 


EM а 
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apshot and record append operations. Snapshot 
E fle ж ыу tree at low cost. E append allows тец 
clients to append data to the same file concurrently while Buarantegin 
atomicity of each individual clients append. | 
Architecture — A GFS cluster consists of a single master and ңі 
chunk servers and is accessed by multiple clients, as shown in the fig Din 


8 
ірі 
B the 


(File Name, 
Chunk Index) 


‘Application 


(Chunk Handle, 
Chunk Locations) 


Instructions to Chunkserver 


(Chunk Handle, 


Byte Range) Chunkserver State 


Chunk Data 
Legend — 


===> Data Messages 
— Control Messages 


Fig. 2.11 GFS Architecture 

Each of these is typically a commodity Linux machine running a user- 
level server process. Files are divided into fixed-size chunks. Each chunk is 
identified by a fixed and globally unique 64-bit chunk handle assigned by the 
master at the time of chunk creation. Chunk servers store chunks on local 
disks as Linux files. For reliability, each chunk is replicated on multiple chunk 
servers. By default, there will three replicas and this value can be changed by 
user. The master maintains all file system metadata. This includes the 
namespace, access control information, the mapping from files to chunks, 
and the current locations of chunks. It also controls system-wide activities 
such as chunk lease management, garbage collection of orphaned chunks, 
and chunk migration between chunk servers. The master periodically 
communicates with each chunk server in Heart Beat messages to give 

instructions and collect its state. 


Q.24. Explain the basic building blocks of Hadoop with a 
[R.GB.K,, May 2019 ( 

4 Ans. A fully configured cluster, “running Hadoop” means running” 
aemons, or resident programs, on the different servers in your 1 
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These dacmons have specific roles; some exist only on on 


: езе ы В 
ss multiple servers. The daemons include — Wer, some exist 


897099 di} NameNode (ii) DataNode 

(i Secondary NameNode (іу) JobTracker 

(v) TaskTracker. 

(i) NameNode — Hadoop employs a master/slaye architectur 
both distributed storage and distributed computation. The distributed een 


is called the Hadoop Distributed File System, or HDFS, Th 

T ue master of HDFS that directs the slave DataNode iut i ae 
the low-level ЏО tasks. The NameNode is the bookkeeper of НОЕ; it keeps 
track of how your files are broken down into file blocks, which nodes store 
those blocks, and the overall health of the distributed file system. The server 
hosting the NameNode typically doesn’t store any user data or perform any 
computations for a MapReduce program. The negative aspect of NameNode 
is that if the NameNodc fail then the entire Hadoop cluster will fail, 


(ii) DataNode— Each slave machine in HDFS cluster will host a DataNode 
daemon to perform the reading and writing HDFS blocks to actual files on the 
local file system. When we want to read or write a HDFS file, the file is broken 
into blocks and the NameNode will The Building Blocks of Hadoop 
tell your client which DataNode 
each block resides in. Your client 
communicates directly with the 
DataNode daemons to process the 
local files.corresponding to the 
blocks. A DataNode may communi- 
cate with other DataNode to 
replicate its data blocks for 
redundancy, Fig. 2.12 illustrates the 
roles of the NameNode and 
DataNodes, Fig. 2.12 


user/james/data2. The datal file takes up three blocks, which we denote 1, 2, 
and 3, and the data? file consists of blocks 4 and 5. The content of the files are 
distributed among the DataNodes. In the fig. 2.12 each block each has three 
replicas to ensure that if any one DataNode crashes or becomes inaccessible 
Over the network, we will still be able to read the files. DataNodes are constantly 
reporting to the NameNode. The DataNodes continually communicate with 
the NameNode to provide information regarding local changes as well as receive 


instructions to create, move, or delete blocks from the local disk. 


. (iii) Secondary NameNode - The Secondary NameNode s Ps 
ап assistant daemon for monitoring the state of the cluster HDFS. Each c : is 
has one SNN. The SNN communicates with the NameNode to take snapshots 


NameNode 


File metadata : 
/user/chuck/datal > 1, 2, 3 


fuser/james/data2 > 4, 5 


DataNodes 
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of the HDFS metadata at шегу iari by Е cluster config, 
NameNode is a single point of failure or a Hadoop cluster, Pain, | m em 
нн m ed ed e E vis A Namen, e N manage the execution of individual tasks on each slave ту 9 Fig.2:14 жады | 
requires human involvement to reconfigure the cluster to use ( MUN ig interaction. | 
TONY NameNode. he SNN at this One responsibility of the Task Tracker is to constantly communicate with % 
NameNode ч pTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker 
£ thin a specificd amount of time, it will assume the TaskTracker has crashed and 
м 


- | Я bmit the corresponding tasks to other nodes in the cluster. The followin, 
f fsimage | Um | КҮТ shows the topology of one typical Hadoop cluster. E 
f кз 
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Fig. 2.13 Secondary NameNodes 
(iv) JobTracker — The JobTracker daemon is the link between үші Fig. 2.15 


application and Hadoop. Once we submit our code to the cluster, the JobTrad: | This topology features a master node running the NameNode and Job- 


determines the execution plan by determining which files to process, assigs) ^ Tracker daemons and a standalone node with the SNN in case the master 


nodes to different tasks, and monitors all tasks as they are running. Ifataskfil|| node fails. The slave machines each host a DataNode and TaskTracker, for 
the JobTracker will automatically re-launch the task, possibly on a diffe) running tasks on the same node where their data is stored. 

node, upto a predefined limit of retries. There is only one JobTracker дает 
per Hadoop cluster. It’s typically run on a server as a master node of theca) „уор modes. 


(и) TaskTracker — Just like the storage daemons, the computing} Ans. A Hadoop cluster has following three types of modes - 
daemons also follow а master/slave architecture — the JobTracker is the mst) (i) Local (standalone) mode (ii) Pseudo-distributed mode 
overseeing the overall execution of a MapReduce job and the TaskTrackts | (iii) Fully distributed mode 


(i) Local/Standalone Mode — After downloading Hadoop in your 
system, by default, it is configured in a standalone mode and can be run as a 
single java process. 
| (а) When we first uncompressed the Hadoop source package, 
it does not consider our hardware setup. Hadoop chooses to be conservative 
and assumes a minimal configuration. | 
{ (b) All three XML files (or hadoop-site.xml before versi 
| arc empty under this default mode — 


| <?xml version = "1.0" 2> 


| 0.25. Explain the Hadoop distributed file system selecting appropriate 


JobTracker 


on 0.20) 


4 ion.xs]"?» 
<?xml-stylesheet type = "texvxsl" href = configuration.xsl 


. : rs > 
<!--Put site-specific property overrides in this file. 
<configuration> 
</configuration> 


Fig. 2.14 | 
т а Е = 
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(c) Its primary use is for developing and 
application logic of a MapReduce program without the aq di Чер, 
of interacting with the daemons. Чопа 

(ii) Pseudo-distributed Mode —1\ is а distributed 
machine. Each Hadoop daemon such as HDFS, Yarn, Марв, 
run as а separate java process. This mode is useful for devela EN 
Ment, | 


(а) This mode allowing us to examine memory y | 
Sage B 
] Dy 


Ерін, | 
Cos, 4 


simulat 
Onon 
ч 


input/output issues, and other daemon interactions. 

(b) Listing 2.1 provides simple XML files to со | 

server in this mode. "ligue, i 

Listing 2.1 Example of the three configuration fj | 
distributed mode 


core-site.xml 
<?xml version = "1.0"?> 


les for ещ! 


<?xml-stylesheet type = "text/xs1" href = "configuration o, 


<!--Put site-specific property overrides in this file.-> 


<configuration> 


<property> 
<name>fs.default.name</name> 
«value-hd(s://localhost : 9000</value > 


<description> The name of the default file system. A URI who 


scheme and authority determine the FileSystem implementation. 
</description> 
</property> 
</configuration> 
mapred-site.xml 
<?xml version = "1.0"?> 
<?xml-stylesheet type = "text/xs1" href = “configuration.xs!"? 
<!--Put site-specific property overrides in the file.--> 
<configuration> 


<property> 


| 


<name>mapred.job.tracker</name> 
<value>localhost : 9001</value> 
<description> The host and port that the MapReduce 
at.</description> 

</property> 

</configuration> 

hdfs-site.xm1 

<?xml version = "1.0"?> 

<?xml-stylesheet type = "text/xs1" href = "confi 
<!--Put site-specific property overrides in this fil 


job tracke 


guration x! d 


e-7 


roj 
і 
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«configuration 


<property> 00, 
<name>d fs.replication</name> 


<уа\ше>1 </value> 

<description> The actual number of replications can be specified 
when the file is created.</description> 

</property> 


</configuration> | 
(с) In core-site.xml and mapred-site.xml we specify the hostname 


f the NameNode and the JobTracker, respectively. 
(d) In hdfs-site.xml we specify the default replication factor for HIDES. 
(е) We must also specify the location of the Secondary 
eNode іп the masters file and the slave nodes in the slaves file — 

eat [hadoop-user@master] $ cat masters 

localhost 

[hadoop-user@master] $ cat slaves 


Jocalhost | : 
(f) While all the daemons are running on the same machine, 


they still communicate with each other using the same SSH protocol as if they 


distributed over a cluster. | 
n (g) We need to check the machine allows you to ssh back to itself. 


(hadoop-user@master] 5 ssh localhost 
(h) If the above command results an 

using the following commands 
[hadoop-user@master] $ ssh-keygen 
[hadoop-user@master] $ cat ~ 


authorized_keys 
(i) Next, we need to format 


and port o 


error then set up shh by 


-t dsa -P " —f ~/.ssh/id_dsa 
/.ssh/id_dsa.pub>>~/.ssh/ 


the namenode by using the following 


MZ bin/had ode-format 
adoop-user@master] $ bin/hadoop namenoce- | 
j (i) To ih the енне by use of the start-all.sh script. oe 
Java jps command will list all daemons to verify the setup was successful. 
{hadoop-user@master] $ bin/start-all.sh 
[hadoop-user(g)master] $ jps 
26893 Ips 
26832 TaskTracker 
26620 SecondaryNameNode 
26333 NameNode 
26484 DataNode 
26703 JobTracker 
(к) We can shut down all the daemons Ш 
[hadoop-user@master] $ bin/stop-all.sh 


sing the command 
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(iii) Fully Distributed Mode — 
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(a) An actual Hadoop cluster runs in the third шый 
distributed mode that emphasizing the benefits of distributeg : 
distributed computation. (ога, 


Physical 
? lle fy Server A 


Пп 


Ргосеззог А. 


а 
(b) In the discussion below we will use the follow; 
ing 
Sy 


names — 
(1) master — The master node of the cluster and h Physical Б 
NameNode and JobTracker daemons. lié % Server А = 
(2) һаскар- The server that hosts the Second | 
ary М; 
daemon. ате | 


Physical 
Switch 


(3) hadoopl, hadoop2, hadoop3 — The slave 


boxes af | 
cluster running both Datanode and TaskTracker daemons. xes of yh 


Physical 
Server C 


9 


Processor C 


Q.26. What is parallel and distributed processing ? Explain with ехату, 
[R. GP.F.,, May 2019 (ИШ-ѕепу | 


Ans. Parallel Data Processing — Parallel data processing involves th, 
simultaneous execution of multiple sub-tasks that collectively comprise a; 
task. The goal is to reduce the execution time by dividing a single larger tag; | 0.27. What is YARN ? Explain its architecture with the help of diagram. 


шош емек tasks tiiat fint содоштепну Ans. To overcome limitations of MapReduce the designers have put forward 
the next generation of MapReduce — YARN. The main purpose of YARN is to 
divide the tasks for the JobTracker. In YARN, the resources are managed by the 


„-----{ Node Manager 


Fig. 2.17 An Example of Distributed Data Processing 


Although parallel data processing can be achieved through тир 
networked machines, it is more typically achieved within the confines ofa| 
single machine with multiple processors or cores. A task can be divided ілі 


Sub-task A 
пут 


Fig. 2.16 An Example of Parallel Data Processing 


Resource 
Manager 


Distributed Data Processing — Distributed data processing ret 
related to parallel data processing in that the same principle of yes 
conquer” is applied. However, distributed data processing is always а і 
through physically separate machines that are networked together 25 j 
In fig. 2.17, a task is divided into three sub-tasks that are then exeo 
three different machines sharing one physical switch. 
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the NodeManager to complete the tasks. 


YARN Architecture — Compared with the 
YARN is more structured and simple. 


old MapReduce 
Archit 
Sct, 


© the different ; 
ResourceManager, designers has divided it into two lower jeu o ом 
viz., the scheduler and the ApplicationManager. On the one hand. сотр | 


з 2 ring th dab 
implementation and task failure. On the other hand, the Applicaii к аррісдӯ, | 


(ii) NodeManager — The NodeManager is the frame 
node. It is responsible for launching the application container, monitoring th | 
usage of the resource, and reporting all the information to the Scheduler, 

(iii) ApplicationMaster— The ApplicationMaster is Cooperating wit 
the NodeManager to put tasks in the suitable containers to run the tasks ay 

monitor the tasks. When the container has errors, the ApplicationMaster wil 
apply for another resource from the Scheduler to continue the process. 
(iv) Container — In YARN, the container is the source unit whichis 
the available node splitting the organization resources. Instead of the Map ani | 
Reduce source pools in MapReduce, the ApplicationMaster can apply for an 
numbers of the container. Due to the same property containers, all the 
containers can be exchanged in the task execution to improve efficiency. 


0.28. What are the advantages of YARN ? қ 
Ans. There аге four main advantages of YARN compared to! 

MapReduce — " " 

j (i) YARN greatly enhances the scalability and availability pt | 

by distributing the tasks to the JobTracker. The Rete and thè 
ApplicationMaster greatly relieves the bottleneck of the JobTrac 
safety problems in the MapReduce. : 

(ii) In YARN, the ApplicationMaster is a customized com| 

means that the users can write their own program based on the p 


: ; ideuse | 
model. This makes the YARN more flexible and suitable for эчен a әредік 
(ili) YARN, оп the one hand, supports the program to t immediat?! 


; ТТЫ, boo 
checkpoint. It can ensure that the ApplicationMaster can re 


Hadoop бу | 


On the other hang į E 
and, it 
ment the failo ете 


Ver. Whe: 
ResourceMana ше 


n the status which was stored on HDFS, 
based 0 er on the ResourceManager to imple: 
ITE. ager receives errors, the backup 
Res These two measures improve the availability of YARN. "шн 
а. cluster has the same containers are the Reduce and | 
Reducc. Once there is a request for resources, the Schedu 
Мар! ilable resources in the cluster to the tasks and regard the 
ше кеткені the utilization of the cluster resources. 


Q.29. Define MapReduce. What is the role of map function and reduce | | 


function ? 


Ans. MapReduce is widely used in logs analysis, data sorting, 

data scarching. MapReduce is the core technology of Hadoop. 

arallel computing model for the big data and supports a set of 
eri А for the developers. 


MapReduce is a standard functional programming model. This kind of 
model has been used in the carly programming languages, Such as Lisp. The 
core of the calculation model is that can pass the function as the parameter to 
another function. Through multiple concatenations of functions, the data 
processing can turn into a series of function execution. MapReduce has two 
stage of processing. The first one is Map and the other one is Reduce, The 
reason why the MapReduce is popular is that it is very simple, easy 10 
implement, and offers strong expansibility. MapReduce is suitable for processing 
the big data because it can be processed by the multiple hosts at the same time 
to gain a faster speed. 


Map pools in 
ler will assign — |. 
Tesource type, | 


and specific 
lt provides a 
programming 


Map Function — Each map function receives the input data split as a set of 
(key, value) pairs to process and produce the intermediated (key, value) pairs. 


Reduce Function — The reduce worker iterates over the grouped (кеу, 
value) pairs, and for each unique key, it sends the key and corresponding 
values to the Reduce function. Then this function processes its input data and 
stores the output results in predetermined files in the user's program. 


ib 
0.30. Write short note on employing Hadoop MapReduce. Also describe 
it’s features and applications. 


Ans. A distributed data processing framework is called acinis 
other words, MapReduce is a framework for processing рас ы referred 
201055 large datasets using a large number of computers, co 6 ed either in а 
10 as a cluster or a grid. Processing can occur on data stor 
filesystem or in a database (unstructured & structured). 

The features of Hadoop MapReduce are as follows - 


РВ number 
() Тһе programming model is simple yet expressive. A large 


is independent of 
OF tasks can be expressed as MapReduce jobs. The model is indep! 
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the underlying storage sy. 


Stem and is 
unstructured data. able to Process 


Sc] 
Input data 1 
dynamically schedules the data blocks to the available n Sid PN. №, 
(ili) It offers fault tolerance wh 85 ОГ ргосе М 
to be restarted. dd only tasks on faileq to 
h 


(v) Pocessing satellite image data 
(vi) Language model processing for statistical machi 
(vii) Large-scale graph computations 

(уш) Index building for various s 
(ix) Spam detection 

(x) Various data mining applications. 


NE translation 


earch operations 


Q.31. Discuss in detail about Hadoop MapReduce. 


Ans. A Hadoop MapReduce job mainly consists of two user-defined 
functions — Map and Reduce. The input ofa Hadoop’s MapReduce Job isa set 
of key-value pairs (k, v) and the map function is called for each of these pairs, 
The map function produces zero or more intermediate key-value pairs (К, v) 
Then, the Hadoop’s MapReduce framework groups these intermediate key- 
value pairs by intermediate key К' and calls the reduce function for each group. 
Finally, the reduce function produces zero or more aggregated results. 


The term MapReduce actually refers to two separate and distinct tasks 
that Hadoop programs perform they are Mappers and Reducers - 


Map Job — The first is the map job, which takes a set of data an ag 
it into another set of data, where individual elements are broken dow 
tuples (key/value pairs). 

The map or mapper's job is to process the input data. rae 
data is in the form of file or directory and is stored in the de fino. 
(HDFS). The input file is passed to the mapper function lin 


f data. ' 

mapper processes the data and creates several small chunks i valuc prit 
; е key- 

The map function produces zero or more intermediate key one dà 


(key', value’). Map function takes one pair of data with ч 5 gab i 
domain, and returns a list of pairs in a different 2. From all lists 
MapReduce framework collects all pairs with the same key 
groups them together, creating one group for each key. 


Map(kl, v1). — list (k2, v2) 


ly the input 


file у 
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Reduce Job — The second is the reduce job, Which takes the Output of a 


: bines those data tuples ; 
“ор as input and сот set pies into a smalle: t 
n. Кейисе function is then applied in parallel to each group, which ае 
асы a collection of values in the same domain, і = 
рг Reduce (К, list (v2)) + lisi(v3) 


Reduce stage is the combination of the Shuffle Stage and the Reduce Stage, | 

The Reducer's job is to process the data that comes from the mapper. Afer — 
ocessing, it produces а new set of output, which will be stored in the HDFS, 

pr 5 Step Process of MapReduce — 

Step 1 - Prepare the Map( ) Input — Set of key-value pairs (k, v) 

Step 2 — Run the User-provided Map() Code — Generate intermediate 

-value pairs (key’, value") and lists Map(k1, v1) — list (k2, v2) 

key Step 3 — "Shuffle" the Map Output to the Reduce Processors — The 

e den system designates Reduce processors, assigns the k2 key-value 
Map cessor should work on. That is, worker nodes redistribute data based 
арч u keys (k2) such that all data belonging to one key is located on 
йс same worker node. | | 

Step 4 — Кип the User-provided Reduce( ) Code — Reduce( ) is run 
exactly once for each k2 key value produced by the Map step. 

Step 5 — Produce the Final Output — The MapReduce system collects 
all the Reduce output, and sorts it by k2 to produce the fina! outcome. 


As the sequence of the name MapReduce implies, the reduce job is always 
performed after the map job. Below fig. 2.19 shows the MapReduce work process, 


Master Node 
=) 


Worker 


Worker 
Map Tasks 


Fig. 2.19 MapReduce Working Process 


Reduce Tasks | Output Files 


Input Files 


Intermediate Files 
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Q.32. Write short note on uses of MapReduce. 
Ans. Uses of MapReduce are as follows — 
At Google — 
() Index building for Google Search 
(i) Article clustering for Google News 
(ш) Statistical machine translation. 
At Yahoo! — 


(i) Index building for Yahoo! Search 
(ii) Spam detection for Yahoo! Mail. 
At Facebook — 
(i) Ad optimization 
(ii) Spam detection. 


0.33. Give the limitations of MapReduce. 


Ans. There are following four main limitations of the MapReduce — 


(Ù) The Bottleneck of JobTracker ~ The JobTracker 
responsible for jobs allocation, management, and Scheduling. It s 
communicate with all the nodes to know the processing status, It is Obvious 
that the JobTracker which is unique in the MapReduce, performs too many 
tasks. If the number of clusters and the submission jobs increase rapidly, it 
will cause network bandwidth consumption. As a result, the JobTracker wil 
reach bottleneck and this is the core risk of MapReduce. 


Should be 
hould also 


(ii) Node Failure — Because the jobs allocation information is too 
simple, the TaskTracker might assign a few tasks that need more sources or 
need a long execution time to the same node. In this situation, it will cause 
node failure or slow down the processing speed. 


(iii) Jobs Delay — Before the MapReduce starts to work, the 
TaskTracker will report its own resources and operation situation. According 
to the report, the JobTracker will assign the jobs and then the TaskTracket 
starts to run. As a consequence, the communication delay may make the 
JobTracker to wait too long so that the jobs cannot be completed in time. 


(iv) Inflexible Framework — The MapReduce currently allows E 
users to define its own functions for different processing stages, 


prc ; ces 
MapReduce framework still limits the programming model and the resour 
allocation. 


. ith 
0.34, Explain the overview of MapReduce execution in Hadoop т 
the help of example. | hines bY 
Ans. The map tasks are distributed across multiple ium split can 
automatically partitioning the input data into a set of M splits. Т en 
be processed in parallel by different machines. Reduce tasks are Phi fun 
partitioning the intermediate key space into R pieces using a partitio 


Hadoo, 
В). The number of partiti Р 65 
er nash беу) ed хя the user. Partitions (R) ang the 
When the user program calls the MapReduce( ) functi 
(i) The — library splits the input fi 
16-64 MB per piece) and starts up many copies of th 
machine iy One of the copies of the program is the Master ; 
3 t are workers that are assigned 85 previously 
fied. The res Ened Work by the master. Then 
M map tasks and К reduce tasks to assign. The master picks the T 
d kers and assigns each one either a map or a reduce task. idle 
wor! (iii) A worker assigned with a map task reads the correspondin 
input split. It parses key/value pairs out of the input data and passes each "E 
"on user-defined map function. The intermediate key/value pairs produced 
by the function are buffered in memory. . 

(iv) Periodically, these buffered pairs are written to the loca] disk 
and partitioned into R regions by the partitioning function. The locations of 
these pairs are passed back to the master who is responsible for forwarding 
these locations to the reduce workers. 

(v) Whenareduce worker is notified about these locations, it uses remote 
procedure calls (RPCs) to read the buffered data from the disks of the map workers. 
When a reduce worker has read all intermediate data for its partition, it sorts it by the 
intermediate keys to group together all occurrences of the same key. If the amount 
of intermediate data is too large to fit in the memory, an external sort is used. 

(vi) The reduce worker iterates over the sorted intermediate data 
and for each unique intermediate key, it passes the key and the corresponding 
set of intermediate values to the user’s reduce function. The output of the 
reduce function is appended to a final output file for this reduce partition. 

(vii) When all map and reduce tasks have completed, the master 
wakes up the user program. At this point, the MapReduce call in the user 
program returns back to the user code. After successful completion, the output 

of the MapReduce execution is available in the R output files. 

_  Todetect failure, the master pings every worker periodically. If no response 
15 received from a worker in a certain amount of time, the master marks the 
Worker as failed, Any map tasks completed by the worker are reset back to 
their initial idle state, and therefore become eligible for scheduling on other 
Workers, Similarly, any map task or reduce task in progress on a failed worker 
35 also reset to idle and becomes eligible for rescheduling. > 

ош Completed map tasks аге re-executed when failure occurs nue ет 
РШ is stored on the local disk(s) of the failed machine and is there 


partitioning 
dns following Occurs — 
ы Into M pieces (usually 

Program on а Cluster of 


these. Completed reduce tasks do not need to be re-executed since 
г Output is stored in a 
global file system. 
Example of a MapReduce — Assume we have five files, and each file 


Contains two columns, a key and a value in Hadoop terms that represent а cy 
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and the corresponding temperature recorded in that cit 


s Á : SY for th | 
measurement days. This example is made very simple so it's ea, © Vari диев them (normal! Hadoop ez |. 
can imagine pae real application contain millions or even bj ons он jobs * ue an job schedulers > available aor 0 oi (i has to be Pointed [| 

а 8 Ows, "e Mob Tracker manages КЕ map and reduce task dap n below). 
Chennai, 33 TI skTracker S. The TaskTrac y 5 execute the jobs based on the » with the 
Calcutta, 32 Ан m the JobTracker and handle the data movement between the ша, 
Delhi, 24 Brem phases. respectively. Any map/reduce construct basically ЕЛЕ е 
Calcutta, 34 re есігі form ofa Directed Acyclic Graph (DAG). ADAG can execute am 2: а | 
Chennai, 38 ӛр arallel, as long as one entity is not an ancestor of another eutity ee ! 
Delhi, 27 ans parallelism is achieved when there йод hidden dependencies anm 
Calcutta, 33 shared states. In the MapReduce model, the internal organization is based x 
Chennai, 37. the map function hpa аі 5. 2. ча entities of (key value] 
К 4 е elements i via their i : 
Let out of all the data collected, we have to find the maximum temperat pairs. nde sace жалды rey) M LN а 
for each city across all of the data files (note that each file might hay (ше the "s Same key) into a single result (see code bein. Tie 2 iie lues 
same city represented multiple times). Using the MapRedu Ve the (with the ap/Reduce 


ce framework, i 
er works on One of 
returns the thaximum 
mapper task for the 


: i ized as depicted in fig. 2.20. 
can break this down into five map tasks, where each mapp DAG is organized as Сер ы 


the five files and the mapper task goes through the data and 
temperature for each city. The results produced from 
above data would look like this — 


The Hadoop MapReduce framework is based on a pull model, where multiple 
TaskTracker's communicate with the JobTracker requesting tasks (either map 
or reduce tasks). After an initial setup phase, the JobTracker is informed about 


(Delhi, 31) a job submission. The JobTracker provides a job ID to the client program, and 
(Mumbai 32) starts allocating map tasks to idle TaskTracker's requesting work items (see fig. 
{Сеи 38) 2.21) Each TaskTracker contains a defined number of task slots based on the 


capacity potential of the system. Via the heartbeat protocol, the JobTracker 
knows the number of free slots in the TaskTracker (the TaskTracker’s send 
heartbeat messages indicating the free slots-true for the FIFO scheduler). Hence, 
the JobTracker can determine the appropriate job setup for a TaskTracker based 
on the actual availability behaviour. The assigned TaskTracker will fork a 
MapTask to ехесше the map processing cycle (the MapReduce framework 


(Calcutta, 34) 
0.35. Explain the architecture of MapReduce in Hadoop. 


Ans, The Hadoop MapReduce MRv1 framework is based on a centralized 
master/slave architecture. The architecture utilizes a single master server 


(JobTracker) and several мар Function Reduce Function spawns 1 MapTask for each InputSplit generated by the InputFormat). In other 

slave servers (Task map(input_record) { reduce(key, values) { words, the MapTask extracts the input data from the splits by using the 

Tracker’s). The JobTracker -- while(values.bas next) { ^ RecordReader and InputFormat for ће job, and it invokes the user provided 
emit(k1, v1) aggregate-merge(valucs.tt 


represents a centralized 1 
program that keeps track emit(k2, v2) collect(key, aggregate) 
of the slave nodes, and -- 

provides an interface ) 

infrastructure for job 
submission. The Task 
Tracker executes on each 
of the slave nodes where 
the actual data is normally 
stored. Users submit 
MapReduce jobs to the Split: Sort 
JobTracker, which inserts Ik1, vi] byki 
the jobs into the pending Fig. 2.20 The MapReduce DAG 


map function, which emits a number of (key, value] pairs in the memory buffer. 


Input Data 


Merge 
ikl, ІУ1, 45 


is Organ 


68 Big Data 


After the MapTask finished executing all input records, the 
cycle is initiated by flushing the memory buffer to the index and q * 
The next step consists of merging all the index and data file pairs ata file 
construct that is (once again) being divided up into local director thy 
map tasks are completed, the JobTracker starts initiating the reduce е5, Ag N 
The TaskTracker's involved in this step download the completed gm ph м 
map task nodes, and basically concatenate the files into а sinple ait бө, 
map tasks are being completed, the JobTracker notifies the At 
TaskTracker’s, requesting the download of the additional region шу 
merge the files with the previous target file. Based on this design, 


downloading the region files is interleaved with the on 


mit Dr 


thy} 


files ag 


: the 
-Boing map task CM 


Г; 

Eventually, all the map tasks will be completed, at Which "цц 
JobTracker notifies the involved TaskTracker's to proceed with ig int the 
phase. Each TaskTracker will fork a ReduceTask (separate JVM’s Pig 
read the downloaded file (that is already sorted by key), and invoke the nit 
function that assembles the key and aggregated value structure into the = | 
output file (there is one file per reducer node). Each reduce task (or map "s 
is single threaded, and this thread invokes the reduce [key, values] function | 
either ascending or descending order. The output of each reducer task i 
written to a temp file in HDFS. When the reducer finishes processing all keys, | 


the temp file is automatically renamed into its final destination file name. 


As the MapReduce library is designed to process vast amounts of data by 
potentially utilizing hundreds or thousands of nodes, the library has to be able 
to gracefully handle any failure scenarios. The TaskTracker nodes periodically 
report their status to the JobTracker that oversees the overall job progress. lt 
scenarios where the JobTracker has not been contacted by a TaskTracker fe 
a certain amount of time, the JobTracker assumes a TaskTracker node failure 
and hence, reassigns the tasks to other available TaskTracker nodes. As к 
results of the map phase are stored locally, the data will no longer be available 
if a TaskTracker node goes offline. 


In such a scenario, all the map tasks from the failed node (regardless € 
actual completion percentage) will have to be reassigned to а Ше 
TaskTracker node that will re-execute all the newly assigned splits. ПЕ pet 
of the reduce phase are stored in HDFS and hence, the data is pane к 
even if a TaskTracker node goes offline. Hence, іп a scenario нче я 
reduce phase а TaskTracker node goes offline, only the set of pa ші 
tasks have to be reassigned to a different TaskTracker node 


0.36. Explain the dataflow and control flow of MapReduce- 

Ans. MapReduce is the heart of Hadoop. It is a programming rk into 2 4 d 
for processing large volumes of data in parallel by dividing the чы localit pat 
independent tasks. The framework processes the feature of - {а ТО аро! 
locality means movement of algorithm to the data instead of да 


Hadoop 9 |- 
ocessing is done on the data algorithm is m РИ 


г Д В Oved аст; 
when pe data to the algorithm. The architecture is so oe DataNodes 
rater Computation is Cheaper than Moving Data, cted because 


с Yt is fault ich i 
Moving Бу its daemons using the concept of replication, ‘inten which is 


educe phase are jobTracker and taskTrackers, mons associated 


JobTracker pushes | 
Striving to keep the 
from the TaskTracker 


MapReduce has a simple model of data processing — Inputs and outputs 
for the map and reduce functions are key-value pairs. The map and reduce 
functions in Hadoop MapReduce have the following general form — 

map- (K1, Vl) > list(K2, V2) 

reduce — (K2, list(V2)) > list(K3, V3) 

Now before processing it needs to know on which data to process, this 
js achieved with the InputFormat class. InputFormat is the class which selects 
file from HDFS that should be input to the map function. An after running 
setup( ), the nextKeyValue( ) is called repeatedly on the Context, (which 
delegates to the identically-named method on the RecordReader) to populate 
the key and value objects for the mapper. The key and value are retrieved from 
the Record Reader by way of the Context, and passed to the map( ) method 
for it to do its work. Input to the map function which is the key-value pair (K, 
V) gets processed as per the logic mentioned in the map code. 

When the reader gets to the end of the stream, the nextKeyValue() method 
returns false, and the map task runs its cleanup( ) method. 

The output of the mapper is sent to the partitioner. Partitioner controls 
the partitioning of the keys of the intermediate map-outputs. The key (or a 
subset of the key) is used to derive the partition, typically by a hash pa 
The total number of partitions is the same as the number of reduce tasks or 
the job. Hence this controls which of them reduce tasks the intermediate key 
(and hence the record) is sent for reduction. The use of partitioners 15 орой. 
doop 


0.37. Write short note on subdividing data in prepa" ation for Ha 
MapReduce. : 

Ans. Central to the scalability of Apache Hadoop is the atcha" 
Processing framework known as MapReduce. MapReduce helps КОШЫ inb 
Solve data-parallel problems for which the data set can be mper 
M: parts and processed independently. MapReduce isan imponan колы 
Cause it allows ordinary developers, not just those skilled in high-per 
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ing, to use parallel programming constructs without wom. : of Map operation. As shown ; Hadoop у | 
"uini details of initea-clister communication, task monitoring it abo, from и, files in different blocks of : an x T9 below, the "" | 
handling. MapReduce simplifies all that. fail, givide (" the number of copies program in the om E and these blocks 
The system splits the input data-set into multiple chunks, each of wh; generat programs with one master node and Severa] data x Cluster has 
assigned a map task that can process the data in parallel. Each map task Їй de known as worker nodes and may be assigned Mune nodes 
the input as a set of (key, value) pairs and produces a transformed sa one are н by the master nodc. ТК or Reduce 
value) pairs as the output. The framework shuffles and sorts outputs Of the wo e the user defines the input files, the master node assi 1 
tasks, sending the intermediate (key, value) pairs to the reduce tasks 1%) 5 i Map function. Those worker nodes who are ms the worker 7 
group them into final results. MapReduce uses JobTracker and Тазкты node is files from different input files and writes ihe n for Map 
mechanisms to schedule tasks, monitor them, and restart any that fail, Chey work P he Map worker nodes finished their work by мане А к disk. 
TheApache Hadoop platform also includes the Hadoop Distributed Files Оой disk, another sets of worker nodes are assigned {ог Reduce fun ide 
(НЕЗ), which is designed for scalability and fault-tolerance. HDFS stores” те assigned worker nodes read the files from local disk and write it to the 
files by dividing them into blocks (usually 64 or 128 MB) and replicating к output files. In this way, the retrieved process is completed in the Hadoop 
blocks on three or more servers. HDFS provides APIs for MapReduce application, MapReduce. 
to read and write data in parallel. Capacity and performance can be Scaled by 0.39. Explain mapping data to the programming framework, ғы 
adding Data Nodes, and а single NameNode mechanism manages data placemen ,39. 


Ans. Many different higher-level Programming frameworks have been 
developed. The most commonly implemented programming framework is the 
MapReduce framework. MapReduce 1S an emerging programming framework 
for data-intensive appli- cations proposed by Google. MapReduce borrows 
ideas from func- tional programming, where the programmer defines Map 
and Reduce tasks to process large sets of distributed data. а 

Implementations of MapReduce enable many of the most common 
calculations on large-scale data to be performed on computing clusters 
efficiently and in a way that is tolerant of hardware failures during computation. 
However MapReduce is not suit- able for online transactions. 


The key strengths of the MapReduce progra- mming framework are 
the high degree of parallelism combined with the simpli- city of the 
Programming framework and its applicability to a large variety of application 
domains. This requires dividing the workload across a large number of 
machines. The degree of parallelism depends on the input data size. The 
тар function processes the input pairs (key, valuel) returning some other 


and monitors server availability. HDFS clusters in production use 
hold petabytes of data on thousands of nodes. 


Q.38. Discuss in detail about Map and Reduce operation ? 


today reliably 


Ans. The Map operation applies computation of key/value pairs in a 
input and Reduce operation combines all the result value that is computed 


User 
Program 


m Fork 


(1) Fork ,^^ 


— 
~, 
ДЕ: dm 
% 


(5) Remote v, жетек pairs (Кеу2, value2). Then the intermediary pairs are grouped 
(4) Local Re^ ану according to their key, The reduce function will output some new 
-valu 


€ pairs of the form (key3, value3). Fig. 2.23 shows an example of 


ü М 
MapReduce algorithm used to count words їп a file. Jn this example the 


Worker) Write li \ 
/ 
ы / 
2 TM (worker) [zm 
N, ^ 


ma - . 
ap key is the provided data chunk with a value of 1. denn 
| ) < | қ 
“7 "d in the ©У is the word itself and the value is 1 every time the im of ihe 
(бе) 7—77 == t key.y Processed data chunk, The reducers perform the aggrega 4 ни 
our ke alues Pair Output from the maps and output à single value for ү 

Input Map Intermediate Files Reduce Fi У, Which in thi А р d. Fig 2.23 provides 

Files Phase (On Local Disks) Phase this case is a count for every word. Fig. 


Fig. 2.22 Map and Reduce Operation 
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further explanation of = 
the generation of the | ®% Жа өлеш. 
key-valuc pairs pro- LX BE $ ЕН E $ Е import java.io.IOException; 
duced during the pro- ge Be ШЕЕН import java.io-IOException; 
cessing phases of the | £2 Bi import org.apache.hadoop.fs Path. 
WordCount MapReduce EE import org.apache.hadoop.conf,*; 
program. $ Е x ЗЕ import org.apache.hadoop.io.* 
А LÀ 5 В 
High performance | 22 E E Е t import org apache сонины, 
is achieved by breaking | = 5 i i i E import саа ети oop.util.*; 
the processing into small | 5 = к public class Froces 
units of work that can | ё & АІ, { 
be run in parallel across | 33 = | = /Маррег Саве 
potentially hundreds or | 3 = 8 ШЕ қ public static class E_EMapper extends MapReduceBase im | 
thousands of nodes in т 2 Б Б 8 N Mapper<LongWritable, /*Input key Type*/ Iplements 
the cluster. Programs БЕ a 5 & 5 Техі, /*Input value Type*/ 
written in this functional | 57 à Text, /*Output key Type*/ 
E are automatically Y & i IntWritable> /*Output value Type*/ 
parallelizedandexecuted | 25 B ( 
8 
on a large cluster of | % Ё 4 AA ^ |, 3 //Map function 
commodity machines. | 3 3 БЕЗ * |3 ic voi i 
mu eras E ЕІ E E = Е АГ: i | public void map(LongWritable key, Text value, OutputCollector Text, 
progra- S zE їч Y IntWritable> output, Reporter reporter) throws IOException 
mmers without any E ЕЕ ЕНІНЕН { 
experience with parallel V VY 99|? $ String line = value.toString( ); 
and distributed systems К String lasttoken = null; 
to easily utilize the | ^ А N StringTokenizer 5 = new StringTokenizer(line, "t"; 
resources of a large | $ $ ^ a оё String year = s.nextToken( ); 
distributed system. $ - A 4 2 while(s.hasMore Tokens( )) 
ZS = ® 22 
MapReduce | $2 E 5 E - 
наои, 2 Р Е ё 8 { i lasttoken = s.nextToken( ); 
written in Java; | £X = Б і i 
siz с int avgprice = Integer.parselnt(lasttoken); 
however they can also | 7 25 $ a | 
: s т ошрш.соПес(һе 1 їсе)); 
bë coded bi languages i H E E 1 E ; } (new Text(year), new IntWritable(avgprice)); 
such as C++, Perl, | 23 £ > } 
Python, Ruby, R, etc. | £ 3 " //Reducer class 
These programs ma е sal i si F 2.2 ubli i i 
P "I$ s| =] 3 2 $z. Z 2 |433 Re ms blic static class E_EReduce extends MapReduceBase implements 
process data stored in 5| 28 зва |%Ен2 Neer<Text, IntWritable, Text, IntWritable> 
different file and S| Ва | вая |39; ( ВЕЕ 
database systems. Fy ШЕ =й 
: " e framework T dac. function 
1 Void reduce(Text key, Iterator <IntWritable> values, 


log OutputCollector<Text Int Writable> output, Reporter reporter) throws 
Xception > 


0.40. Write a program to the sample data using МарКейис 


Ans. package hadoop; 
import java.util. *; 
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R d is repeated in the fi Hadoop 75 
{ any times the wore | е file. In оға -- 
А ti = со! 
int тахаур=30; пон, мн ie sorted in рш steps unt program, the 
int val-Integer. MIN. VALUE; жо Ф Input- In this ER the data are Copied in the Нор | 
while(values.hasNext( )) e inputs contain the words in any format, as an input, 
Thes Б the d. > 
| (4) Splitting — Once the data are copied as input; 
: ; put in t 
if((val=values.next( ).get( ))>maxavg) split into different blocks depending on the sizes of the » HDES, the 
{ data are ® of the file is 500 MB and the size of the block is 128 Mr and block, 
| ifthe „аб into four blocks. In these blocks, the file is arran Ed › the file wilt 
output.collect(key, new IntWritable(val)); esp word is separated by space, comma, etc. sed in such a way 
} that evi . И 
} (iii) Mapping — The | priae of words occurs in the phase. The 
words are counted and given a | value for every word. In this Step, the results 


range Ў зау 
j ae and value 1. Suppose if the file has This is a wordcount example”, then 
//Main function wo esult in this phase will be following - 
public static void main(String args[ ]) throws Exception This, 1 
{ Is, 1 
JobConf conf = new JobConf(ProcessUnits.class); d dud 
L ici TON 'ordcount, 
conf.setJobName("max_electricityunits"); Examplė, L 


conf.setOutputKeyClass(Text.class); 
conf.setOutputValueClass(IntWritable.class); 
conf.setMapperClass(E EMapper.class); 
conf.setCombinerClass(E EReduce.class); 
conf.setReducerClass(E EReduce.class); 


(iv) Shuffle — The result obtained from the mapping steps are further 
sorted and arranged in such a way that repeated words are put together as a 
group. But still the key and value appear to be the same, and the repeated word 
are listed as many time as they appear in the file. For example, ifa word "Apple" 
appearing 3 times and “Mango” 2 times in the file, it will be arranged as— 


conf.setInputFormat(TextInputFromat.class); Apple, 1 
conf.setOutputFormat(TextOutputFormat.class); 2. i 
FileInputFormat.setInputPaths(conf, new Path(args[0]); Mango 1 
FileOutputFormat.setOutputPath(conf, new Path(args[1]); Mango, 1 
JobClient.runJob(conf); ‚_ ( Reduce — In this step, the sorted group of words from the iem 
) bd is further sorted in such a way that the repeated words appear 0 ien 
} Fa and the number of time repeated will be added to the ыла) B5, 
Output — ы pea from the shuffle phase after sorting will looks like following 
Below is the output generated by the MapReduce program. р 
1981 34 > > rest of the 
1984 40 ea are more words in the file, then in the Same Pi hoy 
1985 45 are алы are sorted and the value are added to the n 
e 
А & pe hase аге 
0.41. Explain the wordcount program with example. ài - T. Results — Finally, the sorted words from the ee р 


doop м0! 


het into d the rest 
it wi the о rs can rea 
y that it wil utput where use: 


Ans. Wordcount is a Hello World program of the Ha 
program, the word is counted from the file in such a wa 


HIVE ARCHITECTURE, = 


Map Stage | i 


Fig. 2.24 Wordcount Program is Hive ? Write the advantages and di 5 
s This graphical representation of the entire process is shown in fig, 2,4 0.1. What i5 H 5 d disadvantages of Hive. 
As shown in fig. 2.24, the wordcount process can be separated into three 
! phases - Input Split, Map Task, and Reduce Task. 
| The example code that runs the MapReduce program shown in fig. 224 
can be written in Scala programming language which is similar to Java : all of 
the steps involved in the above description is as follows — 
package com.ims 
import org.apache.spark. (SparkContext, SparkConf} 
object WordCount { 
def main(args : Array[String]) { 
val conf = new SparkConf( ).setAppName("word count").setMaster 
("local[2]") 
val sc = new SparkContext(conf) 
System.setProperty("hadoop.home.dir "CAwinutilW) 
val input — sc.textFile("data/inputs/wordcount.txt") 
val mapOutput = input.flatMap (line => line.split(" ")).map( => (x 1) 
mapOutput.reduceByKey( +. ).collect.foreach(println) 
1) saveAsTextFile("data\\outpo) 


} 


Ans. Hive was originally an internal Facebook project which eventually 
tenured into a full-blown Apache project, and it was created to simplify access 
(o MapReduce (MR) by exposing а SQL-based language for data manipulation, 
Hive also maintains metadata ina metastore, which is stored ina relational database, 
as well as this metadata contains information about what tables exist, their 
columns, privileges, and more. Hive is an open source data warchousing solution 
built on top of Hadoop, and its particular strength is in offering ad-hoc querying 
of data, in contrast to the compilation requirement of Pig and cascading. 

Hive is a natural starting point for more full-featured business intelligence 
systems which offer a user friendly interface for non-technical users. 

Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS 
as well as easily compatible file systems like Amazon $3 (Simple Storage 
Service). Amazon $3 is a scalable, high-speed, low-cost, web-based service 
designed for online backup and archiving of data as well as application programs. 
As provides SQL-like language called HiveQL while maintaining full support 
"m reduce, and to accelerate queries, it provides indexes, including bitmap 
Е exes, Apache Hive is a data warehouse infrastructure built on top of Hadoop 

Г Providing data summarization, query, as well as analysis. 
Advantages — The advantages of Hive are as follows — F 
(i) Perfectly fits low level interface requirement of Hadoop 
6) Hive supports external tables and ODBC/IDBC 

(ii) Having intelligence optimizer 
(iv) Hive support of table-level partiti 


реу 


} 
0.42. Write the advantages of HDFS and MapReduce. 
Ans. Advantages of HDFS — А 
(i) It has very high bandwidth to support map reduce jobs- 
(ii) It is less expensive. 
(iii) We can write the data once and read many times. 
Advantages of MapReduce — 
(i) It supports wide range of language, (ii 


по to speed up the query 
ioning to SP at makes the 


| architecture thi 
) It is platform independ 
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Disadvantages — The disadvantages of Hive are as fo 
(i) It does not support processing of unstructur, 
(ii) The complicated jobs cannot be performed 


m н Using Hive 
(iii) The output of one job cannot to be used to que . 
another jobs. У as inp 1 


lows _ 
ed data 


0.2. Explain installation and running Hive, 
Ans. Hadoop must be installed on our system be 
us verify the Hadoop installation using the following 


$hadoop version//It will display hadoo 
first we have to install hadoop. 


command — la 


P version information Other. 
бк 


Download Hive and rename its folder to/usr/local/hive 
urp@localhost$ cd/home/user/Download 
urp@localhost$ mv apache-hive-0.14.0-bin/usr/local/hive 
exit 


You can set up the Hive environment by appendin: 
~/.bashre file — 


export HADOOP HOME-/usr/local/hadoop 
export HIVE HOME-/usr/local/hive 
export PATH=$PATH:$HIVE_HOME/bin 
export PATH-$PATH:$HADOOP HOME/bin 
After installing Hive execute 
$ cd $HIVE HOME 
$ bin/hive 
Wordcount example for hive 


lit 
Hive > select word, count(1) as count from (SELECT mee 
(sentence, '')) AS word FROM texttable) temp Table group by wore: 


g the following line; ty 


қ describe 
0.3. Explain architecture of Hive and its component. Also 


features of Hive. 
Or sis 
42 ta analy 
Explain the architecture of Hive and how it is useful for da 


ү 
[R.GP.V., Мау 2019, (VIS 


Ans. The architecture of Hive is shown in fig. 3.1 — 


Execution Engine 


ШЕСІТТІТЕНҢ 


HDFS or HBASE Data Storage 


Fig. 3.1 The Architecture of Hive 


This component diagram contains different units. Its components are 
is 
discussed below — | 
i) User Interface— Hive is a data warehouse infrastructure software 
sale interaction between user and HDFS, The user interfaces that 
a rcp are Hive Web UI, Hive command line, and Hive HD Insight (In 
jve 
Windows server). 
i ive database servers to store 
ii) Meta Store — Hive chooses respective \ › 
the ME Metadata of tables, databases, columns in a table, their data 
types, and HDFS mapping. | 
(iii) HiveQL Process Engine — HiveQLis similar to SQL for ате 
on schema info on the Metastore. It is one of the 2. _ im 
approach for MapReduce program. Instead of writing M ides : p 
in Java, we can write a query for MapReduce job and process Il. 


(iv) Execution Engine — The conjunction part of Ber 
Engine and MapReduce is Hive Execution Engine. Execution € : Gee E 
the query and generates results as same as MapReduce resu's. 
favour of MapReduce. 


HBASE 
0) HDFS or HBASE — Hadoop distributed file system or 


: tem. 
are the data storage techniques to store data into file sys! 


Features of Hive — 


data into HDFS. 
G) It stores schema in a database and processed 
veQLor HQL- 


(8) It is designed for OLAP. -— 
(iii) It provides SQL type language for queryie 
(iv) It is familiar, fast, and extensible. 
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0.4. Explain the working process of Hive, 
Ans. The working process of Hive is shown in fig, 3 2 


Hive 


Hadoop 


Fig. 3.2 Working of Hive 
How Hive interacts with Hadoop framework is discussed below .. 


Step-1 Execute Query — The Hive interface such as Command Line q 
Web UI send query to Driver (any database driver such as JDBC, ODBC, etc) 
to execute. F 


Step-2 Get Pian — The driver takes the help of query compiler that parses 
the query to check the syntax and query plan or the requirement of query. 

Step-3 Get Metadata — The compiler sends metadata request to Меш 
(any database). 


Step-4 Send Metadata — Metastore sends metadata as a response to te 
compiler. 

Step-5 Send Plan — The compiler checks the requirement and en 
the plan to the driver. At this point, the parsing and compiling of a quen 
complete. 

Step-6 Execute Plan — The driver sends the execute plan to 
engine. 

Step-7 Execute Job — Internally, the process 0 
MapReduce job. The execution engine sends the job to Jo 
in Name node and it assigns this job to TaskTracker, whic 
Here, the query executes MapReduce job. 

Step-7.1 Metadata Ops — Meanwhile, the execution engine 
metatada operations with Metastore. 


the executi 


HIVE an, 
Result — The execution engine — dPIG gy 


65 the Tesults from 
des- 


no suti ; 
pata 9 Send Results — The execution engine sends those resulta t 
‘nt values 


step- 
jo the driver 


step! | 
05 Write а short note on Hive file format. 


Ans. Hive su 

j as supporuns 
custom formats- 

There are severa 

@ TEXTFILE — TEXTFILE is the easiest to use, but the least 


— The driver sends the re. 
Send Results S the results to Hive 
Interfaces, 


pports all the Hadoop file formats, plus 


Thrift " 
pluggable SerDe (serializer/deserial encoding, as 


ize) classes to support 


1 file formats supported by Hive. 


space efficient. 
(ii) SEQUENCEFILE — SEQUENCEFILE format is more space 
efficient. 
(iii) MAPFILE — MAPFILE adds an index to a SEQUENCE-FILE 
for faster retrieval of particular records. 
Hive defaults to the following record and field delimiters, all of which are 
non-printable control characters and all of which can be customized. 


Hive’s default record and field delimiters are as follows – 


Table 3.1 


For text files, each line is a record, so the line feed character 
separates records. 


Separates all fields (columns), Written using the octal 
code\001 when explicitly specified in CREATE TABLE 


statements. 


^A("control"A ) 


ARRAY or STRUCT, or the 
en using the octal code 
CREATE TABLE 


Separate the elements in an 
key-value pairs in a MAP. Writt 
(002 when explicitly specified in 
Statements. 


ing value in MAP 
Separate the key from the corresponding Y? ; 
y қ sing the octal code \003 when 


E TABLE statements. 


key-value pairs. Written u: 
explicitly specified in CREAT 
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Q.6. 


Ans, 


Discuss about the Hive data types. 


Hive Support two types of data — 
() Primitive data type 

(i) Collection data type, 
() Primitive Data Туре — 
(a) TINYINT, SMALLIN 


en ruc P 
' . T, INT, BIG «К eg 50 llection of key-value tuples, where th 
types with only differences in their size, INT are four Bs For (b) MIEL (с.в., ['key']). For example, if Helle 
L y "s , ‹ , мыз 
(b) FLOAT and DOUBLE аге d using a key — value pairs ‘first’ -» ‘John’ and ‘las > ‘Doe’ 
" tw i ' e key Y ten , 
BOOLEAN is to Store true or false, А owing Point dy, fy. re МАР эйр be referenced using name['last']. 
i last па > ‘John’, ‘last’, *Doe") 
(c) STRING is to store character st ine the map(‘first’, ‘Jobn’, 
we do not specify length for STRING like in other вы Not th гез» (с) ARRAY - Ordered sequences of the same typ that are 
and variable in Jen ases, It ero-based integers. For example, if a column name is oftype 
gth. ing Z > Toe” 
(d) TIMESTAMP can be an int h pie? Тыр with the value ['John', “Doe’], then the second element can 
eger which i 


(е) According to the JDBC 
DD hh:mmiss. ffffffeff. Time component i 


(f) BINARY is used to 


interpreted by hive. It is suitable for binary data. 


The primitive data types are shown in table 3.2. 
Table 3.2 The Primitive Data Types 


Array of bytes 


Single precision float 
Double precision float 10.8932 
Boolean true or false i ғ 
Sequence of characters у + 
TIMESTAMP Integer, float or 9292458793 95 
string values empti 
12:00:00.123456 


date string format ie, YYYY 
s interpreted аз UTC time, 


place raw bytes which will ntk 


” 
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; ata Types — 
geni is (b) MAP (c) ARRAY. 
5 СТ — Analogous to a C strict Oran* 
the “dot” notation. For example, if a coi 
; last STRING}, then the first n. 


Object. “Fields 
umn name is of 
ате field can be 


STRING 


ofs! 
ARRAY d using name. 


ce "--- 
; үне g, array John", ос”) 


Q.7. What do you mean by HiveQL ? 


А ор is ап open source 
Ans, HiveQL is the Hive лу едш oben of dh across а 
fiamcwork for the distributed proces: е to redicë compl ийа О 
cluster. Jt relies upon the Nap Rodus p каш ш across тїї pact 
smaller parallel tasks that can be execu! ем for processing data is not 
However, writing MapReduce tasks on top o hs and a new programming 
foreveryone since it requires learning a new framewo! abstraction on top of 
paradigm altogether. What is needed is an easy-to-use abilities as easily. 
Hadoop that allows people not familiar with it to use its um interface, called 
Hive aims to solve this problem by offering an um sinc ai queries 
QL, on top of Hadoop. Hive achieves this tas! d across the Hadoop 
Written in HiveQL into MapReduce tasks that are then ru 
Cluster to fetch the desired results. | ts of data (such as in 
Hive is best suited for batch processing large ee nsactional database 
data Warehousing) but is not ideally suitable as a routine itum across cluster). 
because of its Slow response times (it needs to fetch data 


eb 
| ing of logs of wel 
А common task for which Hive is used is the ж converted 
servers. These logs have a regular structure and hence can 


Mto a format that Hive can understand and process. 


bles, 
; ike CREATE tà 

Hive query | А rts SQL features li x left outer, 
DRO Ty language (HiveQL) suppo: _ Joins (inner, Y 
ight "на, SELECT ..... FROM ..... WHERE clauses OUP BY, SORT BY 


: R 
“г and outer joins), Cartesian products, G 


database. 
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data бре union and m USE command sets a database as 
©. any us -The "e - Lee out 
агац ро асу ы browsing А functions on primitiy patabase ging working directories in a filesystem. 
SQL. HiveQh alg QL does have ine as list database к 
allow. к Milations c. S 
tabi 5 creati compared - 
le can have Опе or то, n 9f new tables m accordan with trad atabase - You can drop a database 
partitions js further ditik Partitions їп Hive) as w H се w ATABASE JF EXISTS financials; 
orm шей as став bu -— 
"i tables but д beina м and allows inertio, d suppresses wamings if financials does 
.8. Expy: d eletion огу dat 
Plain the working with database ir фә i 2” 
Ans, (i) Нео. " 09 Hive, 


base — We can set key-value pairs in the 
typing — ith a database using the ALTER DATABASE 


bout the database can be changed, including 


er Data 
ociated W 


We need to run the query to crea 


t follows SQL 
te the table. (a) Create Tables ~The Ө eer езүн А vis 
Gi) Create and Show D atabase — They are ye 2. but Hive’s Wee tables are stored, the formats used, etc. 
Clusters with multiple teams and USETS, as a way of sina | 7 lire omnet ну where the da! 
X: 5 > ы 0 
col lisions. It’s also common to use databases to organize ie den, а 
into logical groups. If we do ч 


(A) The tables we have created so far are called managed т 


iles or sometimes called internal tab 
vost e seen, | tables in 
urat AEN n dieci defined by hive.metastore.ware 

subdirectory U er ү | 
(eg шш ы we drop 4 managed table, Hive deletes the 


hive>CREATE DATABASE IF NOT EXISTS financials; 
At any time, you can see the databases that already exist as follows- 
hive>SHOW DATABASES; 


inthe table. i ing with » 
output is mm (C) Managed tables are less convenient for sharing wi 

default қ other tools. 

financia 


(2) External Tables — 


CREATE EXTERNAL TABLE IF NOT EXISTS stocks ( 
exchange STRING, 


hive>CREATE DATABASE human resources; 


hive>SHOW DATABASES; 


symbol STRING, 
output is ymd STRING, қ 
default price open FLOAT, 

financials 


Price high FLOAT, 
price low FLOAT, 
Price close FLOAT, 
Volume INT, 

) Price adj close FLOAT 


human resources 


ion for te 
location 
(iti) DESCRIBE Database — ry 


Shows the directo 


ESCRIBE DATABASE financials; 


hive>D 
output is 


hdfs J//master-server/user/h 
hdIs:/Hie»——— 


NS 


DB 
(ОМ FORMAT DELIMITED FIELDS TERMINATE 
қ 


ancials.db TION */data/scocks/ "s 


ive/warehouse/fin 
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The EXTERNAL ke 
LOCATION 
external. 


yword tells Hi 


f о уе this table 
diets clausc is required to tell 


у is ехе 
Hive where it’s loca, mal а 
ted, Ве, ч th 
Сац. 
Ша ч " 
(3) Partitioned, Managed Tables — P 


to organize data in a logical fashion i 


rtiti 
, Such as hierarchically, опей tableg 
Example — HR people often run queries with WHERE са, 3 
the results to a particular country or to a particular first. "SCS that, 


state in the United States or р п 


rovince in Canada). 
We have to use address,statc to project the value 


xs insid 
partition the data first by country and then by state — © the address, c 


level Subdivisi 


let's 


CREATE TABLE IF NOT EX ISTS m 
name STRING, 


salary FLOAT, 
subordinates ARRAY «STRING» 
deductions MAP<STRING, FLOAT» 


address STRUCT<street:STR ING city:STRING, state:STRING ZipINTs 
) 


ydb.employees ( 


PARTITIONED BY (country STRING, state STRING) 
ROW FORMAT DELIMITED 


FIELDS TERMINATED BY ^001' 
COLLECTION ITEMS TERMINATED BY ^002' 
MAPKEYS TERMINATED BY ^003' 
LINES TERMINATED BY "а! 
STORED AS TEXTFILE; 


Partitioning tables changes how Hive structures the data storage. If we 


create this table in the mydb database, there will still be an employees directo 
for the table — 


LOAD DATA LOCAL INPATH '/path/to/employee.txt 
INTO TABLE employees 


PARTITION (country = 'US', state = ‘IL'); © 
hdfs://master_server/user/hiveAvarehouse/mydb.db/employ 


; have 
in this case) be 
Once created, the partition keys (country and state, їп th 
like regular columns. 


hive>SHOW PARTITIONS employees; 


XISTS employees; 
hie] | p 0 DROP TABLE IFE 


output » 
мен sista 
со . 
Time taken : 0^ familiar D 
propping Tables — The familiar DROP TABLE command 
© 
; supported — 


e=IL 
45 seconds 


HiveQL Data Manipulation — 
Ans. 


@) Loading Da 
CREATE EXTERNAL TABLE IF NOT EXISTS stocks ( 

exchange STRING, 

symbol STRING, 

ymd STRING, 

price open FLOAT, 

price, high FLOAT, 

price. low FLOAT, 

price, close FLOAT, 

volume INT, 

price adj close FLOAT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '; 
LOCATION "/data/stocks/'; 
Queries on Stock Data Set 


та into Managed Tables — Create stocks table — 


Load the stocks — 


LOAD DATA LOCAL INPATH '[patlyto/employees-tt 


INTO TABLE stocks 1%: 
i bol = 'AAPL'); 
PARTITION (exchange = 'NASDAQ', sym 


t doesn't 
This com; 


. c he partition, ЇЇ! 
ad mand will first create the directory for the ра 
Y екіні, then copy the data to it. 
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(ii) Inserting Data into Tables from Queries 


deductions MAP<STRING, FLOAT>, 
address STRUCT <street:STRING city:STRING, state:STRING zip.INT 
) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ^001' 
COLLECTION ITEMS TERMINATED BY ^002' 
MAP KEYS TERMINATED BY ^003' 
LINES TERMINATED BY "а! 
STORED AS TEXTFILE 
LOCATION ‘/data/employees'; 


the data ra : i ] "Fed n 
- » Hiv ay ies Doc 1000005 "Insurance" : 0.3} шы 
(tii) Dynamic Partition Insert — К ipit “30000 0 [Bill King"] ("Federal Taxes" : 02, "State Taxe" - 
02 smi Y i 
May т. 0.1} Г 
nfqsuranice "Federal Taxes" : 0.15, "State Taxes" . 
ae state) di dd ones 70000,0 [ ] {Ее с Taxes" : 003, 
CT... a.count, a.state » 0.1} " 
р | ‚0. "Federal Taxes" : 0.15, "State Taxes" - 
SROM уон Р zi ng 60000.0 11 {"Federa е Taxes" - 0.03, 
m i 
Hive'supponts dynamis = rance" : 0. "john Doe", "Fred Finance"] {"Federal Taxes" - 
Partition featur А 2 қөзі 0000.0 ("Jol , usce 
to create based on query partitions. "eben i cun infer the Pattition, Bos eu 0.07, "Insurance" : 0.6} 
й ,'Slate ` " Accountant"] ("Federal Taxes" : 0,3, 
0.10. Explain the HiveQL queries data. " Fred emo aee TOUT 1. 
Ans. HiveQL Queries — ше шагый 60000.0 [ ] ("Federal Taxes" : 0.15, "State Taxes" : 
; stacy Ассош | 
| (Ü SELECT..FROM Clauses — SELECT is the Projection o 000, "Insurance : 0.1} 
in SQL. The FROM clause identifies from which table, view, or Mer Select Data — 
we select records. | My bive > SELECT name, salary FROM employees; 
Create Employees — output is 
CREATE EXTERNAL TABLE employees ( John Doe 
name STRING, m 
Mary Smith 80000.0 
salary FLOAT, Todd Jones 70000.0 
subordinates ARRAY<STRING>, Bill King 60000.0 


А ive uses 
When you select columns that are one of the collection Um Pies 


d with 


JSON (Java-Script Object Notation) syntax for the output. First, k 
the subordinates, an ARRAY, where a comma-separated list surroun 
[~] is used, 


hive>SELECT name, subordinates FROM employees; 


Output is 
John Doe | 
"Магу Smith", "Todd Jones"] 
Mary Smith ("Bill King"] 
Todd Jones [ ] 
Bill King [] 


vv 219 Data 


The deductions 


is a 
used, namely a comm MAP, wh 


еге the JSON re 
a-separated list 


Pre, А 
of key : value раў Sentatig 


; ї$, Sumon, um cT * 
Е à > {le 
hive>SELECT name, deductions FROM à ndeg wi r^ oM eee 
; : і 

Б „HERE LECT anonid 

e б FROM geog, all 

2. Тахев" : 0.2, "State Taxes" : 0.05 "Ins TABLESAMPLE (BUCKET3 OUTOF 1500N апуу, 

mith я: ` ~ USürancu , ЛЕР ‚ 
нана dea аак А Wes "State Taxes" . 0.05, ^ See ribe the table joins in querying. 
axes" : a я Suran, " c - 

Bill King ("Federal гарч iind 1003, таа, 2. Des elational database system, the ability to join tables together 

i donis. es" : 0.03 "Ins папу! : Joins are used to combine th 
Finally, the address is ses Шыны) irement. Jo € rows from 
map format : а STRUCT, which is also Writte 


they do need to have a common usage. For example, in the 
*t, but 


they WOU here is the anonid column and in the elec or gas tables 
John Doe aphy table the í es they represent an anonymised 
{"street" : "] Michigan Ave.", "city" ; "Chicago", "staten 4. : the anon, id column 10 both cas у гер 
"zip" : 60600) rg household. ; es of join possible as shown in table 3.3. 
^ Mary Smith ("street" : "100 Ontario St.", "city" "Chicago" = There are several different typ J 
: "IL", "zip" : 60601) > Stat? 


Table 3.3 


о 


Matched rows in both tables are returned. нЕ 
All row in the left hand table are returned along with е 
matches from ће right hand table or NULLs if there is 
no match. М 
All row in the right hand table are returned "d н п 
the matches from the left hand table ог NULLsi 

no match — 
All rows from both tables are returned, with 
where there are no matches. 


Todd Jones [("street" : "200 Chicago Ave. 
"state" : "IL", "zip" ; 60700} 


Bill King {"street" : "300 Obscure Dr.", "city" : "Obscuria", ‘sue! 
: "IL", "zip" : 60100} 


Inner join 


Left outer join 
0.11. Write a short note оп sub queries. 

Ans. The problem with this sampling approach is that it does not take inb 
account the fact that the electricity usage for a given consumer on иан, 
spread over 48 rows of the table. If a sample was taken using the above "е s. 
directly it is highly unlikely that there would be any complete days of usage for 

of the consumers making any further useful analysis al] but impossible. 


which 
However, by using this method to sample the geography tabi пб 
contains a single row only for each household instead of sampling 


Right outer join 


Full outer join 


joins can be 
А hol 7 The outer joins © 
readings table directly it is possible to select a sample of unique не нді ы the inner join is the most commonly used. a is rarely used 
^| : H H м тај x ul 
based on the anonid variable values. The anonid variable in the geog wib 9 


: n oss joi 
T exploring or discovering missing data. The cross) 


"ld create extremely large tables. 
For 
Show. 


a le dala and 
represents the households and can be used to join the geography file to 


the meter readings data. ples all ofh 
In the query below we are selecting from the elec, all table 
associated with the anonids in bucket 3. 


"mal Eats 25 
. d Ашта! | 
Examples — Using the two small tables Animals an 


Ті table 3.4 and table 3.5, 


c 10" 


às Inner Join results is as shown in table 3.6 


Table 3.6 


a Left Outer Join results is as shown in table 3.7. 


Table 3.7 


а Right Outer Join results is as shown in table 3.8. 
Table 3.8 


Name | Е 


Elephant I Hay 
Cat 3 Fish 
Dog 4 Meat 
NULL | NULL 6 | Goldfish food 
NULL | NULL 7 Lettuce 
Goat 8 F lowers 
Pig 10 Anything 
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sults is as shown in table 3,9, 
Table 3.9 


г Join re 


NULL 
Fish 
Meat 

Goldfish food 

Lettuce 
Flowers 
NULL 


tatement — 


An example join s E 
RCT e.anon id, g.anon м 
SEL ROM distinct elec c AS e 
JOIN distinct .gas Cas в 
ON e.anon id — g.anon id; 


0.13. Define user defined functions in Hive. 


Ans. There are multiple ways to extend Hive's functionality including 
writing custom user defined functions (UDF). 


(i) Simple Functions — Concat can be used to add strings together 


SELECT anonid, 
acorn_category, 
acom_group, 
асот їуре, 


concat (acorn category, 
ШЕП 
o> 


acorn_group, 
м" = 
2> 


acorn type) 
As acorn code 
FROM geog all; 


substr can be used to extract a part of a string 
SELECT anon id, 


advancedatetime, 

substr (advancedatetime, 1, 2) As day, g 
substr (advancedatetime, 3, 3) AS month, 
substr (advancedatetime, 6, 2) As Ye% 


SELECT anonid, 
асогп code, 
length {acorn code), 


id, 
ivancedatetime, 6,2) AS reading Year, 


m tal row count 

Instr (ac ‚ү | leckwh) AS ізі | OW | 1 

instr (a en AS -fàthos, | e ond) AS total period usage, 

FROM geog ај. acom_code), nn) А$ | sum (е t AS min period mbage, 
Seog all; | гіш (eleckwh) з 4. 

AS avg period usage, 
ere needed functions can be ре | así (eleckwh) ер р 


; eriod usa, 
Cast and type conversions kwh) AS max p | usage 


max (elec 


FROM e id, substr (advancedatetime, 6,2) 
GROUP id aod "id. reading year; 
ause does. 
rd provides a set of unique combination of column 


T 
на distinct Кеуу/0 


[ Б The ; t any kind of aggregation. 
(Ù Aggn gations — Aggregate functions are used to perp < within a table —— fuel 
of mathematica] Ог statistical calculation across а group of Perform son) SELECT DISTINCT, eprofileclass, fueltypes 
each group are determined by the different values in a Specified col, The LN ROM geog, all: 
A list of all of the available functions are given in the 1 Е 


өтей 
арасһе documen 


E 


(iii) Date Functions — Ín the elec c and gos © tables, the advancedate P 
; | although it contains a timestamp type information, itis defined 
2” E mes For much of the time this can be quite convenient, however 
x а when we really do need to be able to treat the column as a 
енн Perhaps the most obvious example is when you need to sort rows 
based on the advancedatetime column. 


SELECT anon id, 
count (eleckwh) As total row count, 
sum (eleckwh) As total . period usage, 
min (eleckwh) As min | period usage, 
avg (eleckwh) As avg | period usage, 
max (eleckwh) As max . period usage 

FROM elec c 
GROUPBY anon id; 


In the above example, the aggregation were performed over the е; 
column anon id. It is possible to aggregate over multiple columns by spei 
them in both the select and the group by clause. The grouping dn 
based on the order of the columns listed in the group by 5. Ме 
allowed is specifying a non-aggregated column in the select clau 

not mentioned in the group by clause. 


SELECT anon id, | 
substr (advancedatetime, 6, 2) AS reading year, 


count (eleckwh) AS total row count, 
sum (eleckwh) As total | period usage, 
min (eleckwh) As min . period usage, 
avg (eleckwh) As avg _ period usage, 

: max (eleckwh) As max period usage 


FROM elec c | " 
GROUP BY anon id, substr (advancedatetime, 6, ) 


Hive provides a variety of date related functions to allow you to convert = 
strings into Timestamp and to additionally extract parts of the Timestamp. 


i integer! 
unix_timestamp returns the current data and time — as an intege 


in i cognisable 
from unixtime takes an integer and converts in into a recogn 
Umestamp string 


SELECT unix timestamp ( ) AS currenttime 


FROM sample 07 
LIMIT l; 


БЕ ; nttime 
SELECT from unixtime (unix. timestamp ( )) AS curre 
FROM sample 07 
LIMIT 1; 
1 here are 
foma Timest 


extract the relevant paris 


Various date part functions which will 
amp string 
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its main a 
lables са 


‘vantage 


NIX_TIMESTA Mp 
n be геад 


AS full year, dis 


TE: 2 r ы ing also makes it 
Чаммуу'))) nth (from_unixtime (UNIX_TiMEs with streaming aXes It easy for g 


s dpt they have already debugged on a sm 
thon 5€ et. PIG is best for semi structured da 


€searchers tg take 


TAMP (tex all data set and run 


AS full month ы й m 
к ; me lata 5 " > Or Programmi 
day (fi ixti ш. a huge © e, does not support partitions, d Зан. 
'ddMMyy))) yt Тот unixtime (UNIX TIMEsT. з dural language О not have dedicated 
AMp (еа; database ы 


«plain the architecture of Apache Pig and its components, 


115. E | . m 
0. The language used to analyze data in Hadoop using Pig is known as 
An. ni high-level data processing language which provides а rich set 
pig mei and operators to perform 
of на operations on the data. 
ue perform а particular task 
анат using Pig, programmers 
а to write а Pig Script using the Pig 
ws language, and execute them using 
any of the execution mechanisms (Grunt 
Shell, UDFs, Embedded). After execution, 
these scripts will go through a series of 
transformations applied by the Pig 
Framework, to produce the desired output. 
Internally, Apache Pig converts these 
х 5 
scripts into a series of MapReduce jobs, 
a thus, it makes the programmer’s job 
tasy. The architecture of Apache Pig is 
shown in fig. 3.3. 


Ж A shown in the fig. 3.3, there are 
"ous components in the Apache Pig 


framew 
ween Let us take a look at the major 
Ponents, Fig. 3.3 Apache Pig Architecture 


AS last day of month, С 
ixtime(UNIX_TIMESTaMp 


date add ((from ла 

(reading date, 44ММуу))), 10) -— 
AS added days 

FROM elec days c 


ORDER BY proper date; 


Apache Pig 


Grunt Shell | | Pig Server | 


INTRODUCTION TO PIG, ANATOMY oF PIG, PIG oN 


i HADOOP, USE CASE FOR PIG, ETL PROCESSING 


О.14. What is Pig ? 
Ans. Apache Pig is a platform for analyzing large data sets that сохі 
ofa high-level language for expressing data analysis programs, couplis 
infrastructure for evaluating these programs. The salient property of 
Programs is that their structure is amenable to substantial parallelization ® 
application that creates map-reduce jobs based on a language called Pig 
which is workflow driven. It was originally created at Yahoo! (comp! 
Apache Pig is good for structured data too, but its advantage is the ahin’ 
work with BAGs of data (all rows that are grouped on a key), it is simp" 
implement things like — 
(i) Get top М elements for each group 
(ii) Calculate total per each group and than 
each row in the group | 
(ш) Use Bloom filters for JOIN optimisations — "m. 
-- (iv) Multiquery support (it is when PIG tries к mini 
` on MapReduce Jobs by doing more stuff in а single job) 


"T" Nus Parser — Initially the Pig Scripts are handled by the Parser. It 
фер тік Of the script, does type checking, and other miscellaneous 
теры, PUE Of the parser will be a DAG (directed acyclic graph), 


Which teprec 
" мен the Pig Latin statements and logical operators. қ 
3 | i i 
"odes ang t AG, the logical operators of the script are represented a5 


he 
data flows are represented as edges. 


put that total ш> 


е 


ә 


ве ee 


(11) Optimiz, 
optimizer, which са zer — The lo 


; gical plan Mie 
tizer ' бад | . Ind Pig 
pushdown. Out the logical Optimizations’ a ғ Apache Pig — Apache Pig comes With the foli 99 ғыз 
TN uc 4:0 “н 
tii) Compiler The “ Pj | 
into a ѕегі 7 The compil ч erators — Yt provides man, 
| а series of MapReduce регі Piler compiles the Optimi fich ee etc. Y Operators to Perform 
| (iv) Execution Е, 7 Е 9% ke jo yn oo 
| e - tin is simil. 
| to Hadoo : : ngine — Finally the ” Programming Б mine to SQL and it ig 
| wg p 2 а sorted order. Finally, ia Mar арос jy kerd a if we are good at SQL. QL and it js 
I P Producing the desired results. PReduce jobs are cM, ap дайоп Oppor tunities — The tasks in Apache Pig opti 
| si m з 
| bonds A Grunt Shell — Grunt is Pig's inter А Coy, Op P tomatically, so the programmers need to focus be 
А . aci " 
| t g Latin interactively and рота се Shell, д ела) пей language- Р 
HDFS. ell for user to inn " xtensibiliy — Using the existing operators, users сап йені 
5 In other Words, Grunt is command inte i 4 sions genai аы | 
: (€ io ili 
the Grunt command line and Grunt will Saas сап type Pig i aem s — Pig provides the facility to create user-defined functions 
2 omma Б i i 
Q.16. Why we need Apache Pig ? Explain nd rogramming languages such as Java and invoke or embed them in 
Ans. Programmers who ar 
; 4 ! е not so good at Ja 
working with Hadoop, es A 


: г Огтај 
pecially while performing any Mann st 
all such programmers — ni 
| : (i) Using Pig Latin, programmers can perform MapRed 
easily without having to type complex codes in Java. TM 
(ii) Apache Pig uses multi 
length of codes. For example, 


wi Handles all Kinds of Data — Apache Pig analyzes all kinds of 
Apache Pig is a boon for 


both structured as well as unstructured. It stores the results in HDES. 
ia, 00 . 
dal Q18 What are the major differences between Apache Pig and MapReduce ? 


Ans. The major differences 
follows — 


Apache Pig MapReduce 


Apache Pig is a data flow 
language. 


Itis a high level language. 
Performing a join operation in 
Apache Pig is pretty simple. 


-query approach, thereby reducing 
an operation that would require to type) 
lines of code (LoC) in Java can be easily done by typing as less as just I6 


in Apache Pig. Ultimately Apache Pig reduces the development time by and 
16 times. 


MapReduce is a data processing 
paradigm. 


MapReduce is low level and rigid. 


It is quite difficult in MapReduce to 
perform a join operation between 
datasets. 

Any novice programmer with Exposure to Java is must to work 
abasic knowledge of SQL can | with MapReduce. 


ne conveniently with Apache 
ig. 


(iii) Pig Latin in SQL-like language and it is easy to learn Арас 
when we are familiar with SQL. | 
(iv) Apache pig provides many built-in operators to елын 
operations like joins, filters, ordering, etc. In addition, it "s pim 
data types such as tuples, bags, and maps that are.missing 110 


ig? 
Q.17. What are the applications and features of Apache Pis 


ji 

igi jly used by 

Ans. Applications of Apache Pig - Apache Pig is Pit ШТ 
scientists for performing tasks involving ad-hoc p 

prototyping. Apache Pig is used as — 


apache Pig uses multi-query | MapReduce will req. ire almost 20 
з he thereby reducing the | times more the number of lines to 
cient the codes to a great | perform the same task. 


b logs: 
2 s such as WE 
(i) To process huge data source 


There is 
platforms. 


nane. nO need for compilation, | MapReduce jobs have а long 
ut ; ee 
Operator ue every Apache Pig | compilation process. 


S converted internall 
i rna 
Ito a MapReduce job. y 


(ii) To perform data processing for search 


iti ds. 
(iii) To process time sensitive data loa 
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.19. What 7 js 3 ; 
0.19. What do you mean by Pig Latin ? ually increasing Population, cri 


wi uh contin 


Н і for 
| EM s a huge issue governments y 
with Hadoop. The language for this platform is сайы ше P jated rn law and order. The benefit of using раке Strategic 
abstracts the programming from the Java MapReduce d Pig КААЛ as to ga have to be written which reduces overall кр Sis 
which makes MapReduce programming high level, Simin ing tt lines 9 lopment 
RDBMS systems. Pig Latin can be extended using UDF е MM or g time рів script а large scale data Processing syst 
which the user can write in java, Python, Javascript КАШ? ) using data through Map Reduce programming in Had "s 
then call directly from the language. > by or Gh g web fent = 
ig efficiet luate the performance of à 
Шы d to evalua P : € of a commer 
Q.20. What are the advantages of Pig as compared to 5 spia simulation analysis tasks, ctal RDBMS 
f in asi Қ 
Ans. In compare to SQL, Pig have following T OL » padoop ! do you mean by ETL big data with Apache Hadoop ? 
(i) Uses lazy evaluation v as 


jous stages in ЕТІ. | ; 

hallenge of extracting value from big data is similar in many 
ns. The © jd problem of distilling business intelligence from transactional 
heart of this challenge is the process used to extract data from 
dta. At the һе transform it to fit your analytical needs, and load it into a 
oUrces, for subsequent analysis, a process known as “Extract, 
& Load” (ETL). The nature of big data requires that the 


(ii) Uses extract, transform, load (ETL) 
(ш) Able to store data at any point durin: 
(iv) Declares execution plans 


(v) Supports pipeline split, thus allowing workflo 
DAGs instead of stricty sequential pipelines. STO Ii 


8 à pipeline 


Г 


0.21. What are the major differences between Pig and SQL ши, 
Ans, The major differences between Apache Pig and SQL are as n, 


Pig Latin is a procedural language. | SQL is a declarative langug. 
In Apache Pig, schema is optional, | Schema is mandatory in 501. 
We can store data without designing 
a schema (values are stored as $01, 
$02, etc.) 

The data model in Apache Pig is 
nested relational. | 
Apache Pig provides limited oppo- | There is more vcri 
rtunity for query optimization. query optimization tn Q 


Mining 
me. 
Reporting. 
p 


aditional ETL process extracts data from multiple sources, then 
ource 4 formats, and loads it into a data warehouse for analysis. When the 


The data model used in QU 
flat relational. 


Fig. 3.4 ETL Process 


А 
0.22. What are the important uses of Pig. Aan 


Ans. Some of the important uses of Pig are as follows ^ m. bia “ts are large, fast, and unstructured, traditional ETL can become 

(i) Pig is a powerful tool for querying data in a е иШ 2nd kes ton’ because it is too complex to develop, too expensive to operate, 

so powerful that Yahoo estimates that between 40% and 60% „Ву Ба Ong to execute, et in a big data 

workloads are generated from Pig Latin scripts. — ә i t goes ла Е 80 percent of the development йн pei Тең 

(ii) Pig is also used at Twitter (processing logs, costi ші te ore а боны are SER las ak cate USD 60 K per 

at AOL and MapQuest (for analytics and as pice Ye. Analyzing пра ааа» nó ойе processes in | 
Linkedln, where Pig is used to discover people you = 


ж > ы а 


USD 60 M . HIVE and pig Р 
Strategy that any CIO can affo 27 offload ETL with 9 03 
Vari » is, plait «in Hadoop (on a adoop-com, айы, 
arious Stages in ETL _ “ч, бо the дел еі ETE tasks of cleansing, norma ing раме 
% tradi loying the massive салы, 5^ e ning, and 
Table 3. Ma re for EDW by employ lability or MapReduce. 


М 


Transformation 


p ge volume T 
loaded in a short period and should be wa 


0.24. Explain function of ETL tools in Apache Hadoop. 


Ans. ETL tools move data from one place to another by performingie| 
functions — 


(0 Extract Data from sources such as ERP or CRM Application- 

During the extract step, you may need to collect data from several ЕС, 

and іп multiple file formats, such as flat files with delimiters (CSV) aen 

files. You may also need to collect data from lagacy systems that E жс 
arcane formats по one else uses anymore. This sounds easy, but = 

onc of the main obstacles in getting ап ETL solution of the grount. 


5 Off-loading ETL with Hadoop 


Fig. 3. 


Staging Area 


Fits ohi 
(ii) Transform that Data into a common dee ek ü 
Data in the Warehouse — The transform step may xd g зой p 
manipulations, such as moving, splitting, көрсе Rem Pio fr ay 4 
and more. For example, a customer name ы ko ы (eg P, 
names, or dates might be changed to the standar fidating the dataa 
24-13 to 2013-07-24). Often this step also involves va 


data quality rules. 


ouse 
(iii) Load the Data into the веле = 
step сап be done in batch processes. or row by row, 
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Using Hadoop allows to avoid the transformation bottle, 
ETLT by off-loading the ingestion, transformation ác. in tragi. 
unstructured data into data warchouse as shown in fig. 3 5 Megs y 
enables to embrace more data types than ever before, it enrich Becas Ni ч 
in ways that would otherwise bc infeasible or prohibitive, Becas data Wat 
performance, the ETLT jobs can be accelerated significantly, тан of; LN 
data stored in Hadoop can persist over а much lon ; COVE, м, 


с nger duration wee Жаы 
more granular, detailed data through EDW for high-fidelity analys тық 


Using Hadoop in this way, the organization gains an additi 
store and access data that they "might" need, data that may р tonal ap; y 
into the data warehouse. For example, data scientists might wet lei 
large amounts of source data from social media, web [буз аш 10 use М 
stores (from curators, such as data.gov) stored on Hadoop Ан би, 
analytic models that drive research and discovery. They can % mance Ney 
œ cost effectively in Hadoop, and retrieve it as needed (using He ftis dy 
^ analytic tools native to the platform), without affecting the EDW ew OF Othe 
Regardless of whether using the ETL, ELT or ETLT Hii, 
warehousing, the operational cost of overall BVDW Solution can be i 
by off-loading common transformation pipelines to Apache Hadoop and d 
MapReduce on HDFS to provide a scalable, fault-tolerant platform d 
processing large amounts of heterogeneous data. i 
, DATA TYPES IN PIG, RUNNING PIG, EXECUTION MODEL oF 
: PIG, OPERATORS, FUNCTIONS, DATA TYPES OF PIG 
[SSSR TNS NITES IY TINT I SEED REN IDEE) 
0.26. Describe user defined data types in Pig. 


Ans. Pig’s data types can be divided into two categories — 
(i) Scalar types ` 
(i) Complex types. 
(i) Scalar Types — It contains a single value. Pig’s scalar yen 
simple data types that appear in most programming languages. These inclu 
(a) int — An integer store a four-byte signed integer 
(b) long — A long integer store an eight-byte signed ese , 
(c) float — A floating-point number uses four bytes 0 
their value. 


. . usce 
(d) double— A double-precision floating-point number 


bytes to store their value. "e 
(e) chararray — A string or character arra 
string literals with single quotes. 


(f) bytearray — A blob or array of bytes. 


(i) comp lex ata types such as 
and bags 


А апу type. i б 
" 55. So it is possible to have a map where the К 
LM трех в a tuple where one of the fields is а m р isa 


ар. 
и "m Мар- A map in Pig is а charatray to data dlei 
element се 


: : ent i 
п be any Pig type, including a complex type mapping, 
key and is 


i - The chara 
used as an index to find the element, referred to ined 
à i d-length, ordered 
gaue Tuple — А tuple isa fixe gth, ordered collec 
O йө s are divided into fields, with each field containing one 
m elemen. ese elements can be of any type — they do not all need to he 
€ A tuple is analogous to a row in SQL, with the fields being SQL 
type- 


tion of Pig 


(Field [, field г) 

(c) Bag — A bag js an unordered collection of tuples. Because it 
der, it is not possible to reference tuples in a bag by position. Like 

bas NO ga 008 but is not required to, have a schema associated with it. In 

ае Бай; the schema describes all tuples within the bag. 


Syntax — {tuple[, type .....]} 

(d) Nullis - Pig includes the concept of a data element being 
null. Data of any type can be null. It is important to understand that in Pig the 
concept of null is the same as in SQL, whic is completely different from the 
concept of null in C, Java, Python, etc. In Pig a null data element means the 
value is unknown. 


Syntax — 


(e) Casts — Indicates convert one type of content to any other 
type. 


The data types scalar and complex table are as follows — 


Table 3.11 Data Types in Pig 


Description 
ion Signed 32-bit integer | 10 
8 Signed 64-bit integer | Data: 10Lor 101 
isplay: 101. 
float Disp ) m 
32-bit floating point | Data: 10.5Еог 10.5f or 10.5е 
or 10.5E2F 
; à 050.0Е 
double Display: 105Ғ or P^ 
64-bit floating point | Data: 10.5 or 10.5e° 


Display : 10.5 or 1050.0 
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chararray Character Aarray 
(string) in Unicode 
UTF-8 format 
bytearray Byte array (blob) 
boolean boolean 


Complex Types — 


An ordered set of fields (19, 2) 
An collection of tuples {(19, 2), 


0.27. Explain in detail installation 
Ans, Pig Installation — Linux users 
(1) Hadoop 0.20.2 
(i) Java 1.6 (set JAVA_HOME to the гоо 
(ш) Ant 1.7 (optional, for builds) 
(iv) JUnit 4.5 (optional, for unit tests). 


To geta pig distribution, d 
Apache Download Mirrors. 


Unpack the downloaded Pig distribution. 
bin directory. 


and running 9f Pig. 
need the following - 


t of Java installation) 


ownload a recent stable release from One of the 


The Pig script is located in te 
Add/pig-n.n.n/bin to your path. Use export (bash, sh, ksh) or sete 
(tcsh, csh). For example — 
urp@localhost$ export PATH-/usr/local/pig-0.14.0/bin:$PATH 
Try the following command, to get a list of Pig commands 


urp@localhost$ pig-help 
Try the following command, to start the Grunt shell 
urp@localhost$pig | мі 
@ Run Modes - Pig has two run or execution modes ! 
mode and map reduce mode. | | «ole JUM 0! 
(a) Local Mode — In this mode, pig run in a aan of small 
makes use of local file system. This mode is suitable only for a 
i i "ten [n Pi 
data set using Pig. . tien in 
| ү" Map Reduce Mode — In this mode, quer eu ор clus 
Latin are translated into MapReduce jobs and are run опа pig 0" 


: f running 
MapReduce mode with fully distributed cluster 1s useful 0 


large dataset. 
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— Grunt is a command inte 


d line and Grunt will exe ап type Pig 


Cute the command 


IbostS pig-X mapreduce 
а! 


the Grunt shell is invoked and we can enter "— 
sults are displayed to terminal screen (if DUMP is used) 

Тһе геѕи 

le (if STORE is used). 
xample for Pig — | 

E '/piginput/pigexample.txt'USING PigStorage as 

-L 


{ 
Worde 
guna“ 

(word:chararray)s 
uae Е ACH a GENERATE FLATTEN (TOKENIZE(word); 
grunf>wor ка 

grunt>dump word; 


grunts>grouped=GROUP words by $0; 
grunt>dump grouped; 


grunt>word_counts=FOREACH grouped GENERATE group, 
COUNT(words); 


: 1' USING 
grunt>store word counts into "/pigoutput/word count 
PigStorage; j 
— е. 
0.28. Explain data processing operators in Pig Latin with examp 


s for the 
Ans. Pig Latin has a very rich syntax. It supports operator 
owing operations — 


folk 


@) Loading and storing of data 
(i) Streaming data 

Өй) Filtering data 

(iv) Grouping and joining data 

(v) Sorting data 

(vi) Combining and splitting data. 


TPreter. We ¢ dis 


4 
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Data — Stores OF Saves results to the fije system 


Pig Latin also supports a wide variety of types 


А = > €Xpress; , ring iie 
diagnostic operators, macros, and file system өзімді Sssiong. func (i) Sto ана INTO 'directory [USING Function}: 
(Ü DUMP ~ recta i ч, TORE І | 
i) Dump directs the output of your Script to "m -5 jn this example data is stored using PigStorage 
Syntax — dump out.txt; oura Р amples ^ Gi delis Бе and the 


ЕС (i) LOAD – Loads data from the file System 
| А 


| Syntax — LOAD ‘data’ [USING function] [AS Schema]. 


‘data’ is the name of the file or directory, in sin 
are keywords. If the USING 


r clause is omitted, the def 6 
PigStorage is used. Schema — А schema using the AS ла ault load fines 
parentheses. siti enclose i 
| Usage — Use the LOAD operator to load data from the file syst 
| tem, 
For Examples — Suppose we have a data file called myfile.txt. Th 
are tab-delimited. The records are newline-separated. с е 
123 
| 421 ' : G PigS! my 
| STORE A INTO 'myoutput' USING PigStorage (%); 
834 
} 


CAT myoutput; 


| In this example the default load function, PigStorage, loads data fron 
Үр myfile.txt to form relation A. The two LOAD statements are equivalent, Not 
that, because no schema is specified, the fields are not named and all fiels 
default to type byte array. 


A = LOAD 'myfile.txt' USING PigStorage(‘\t’'); 


DUMPA; 
Output — 
(1, 2, 3) > ^ ipt or program. 
(4,2,1) (iv) Streaming Data — Sends data to an external script or p 
(8, 3, 4) б) Grouping and Joining Data — 


nds, whid 


i а 
Sample Code - The examples are based on these Pig comm 
extract all user IDs from the /etc/passwd file. 


A = load 'passwd' using PigStorage(':’); 


(а) Group — Groups the data in one or more relations. 


А ioin of two or more 
айол, bas ©) JOIN (Inner) — Performs an inner join of two 


*d on common field values. 


B = foreach A generate $0 as id; 


TA P relations 
бі оп m JOIN (Outer) — Performs an outer join of “0 Te 


dump B; mon fields values. 


store B into 'id.txt'; 


time. 
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Example — Suppose we have relatio 
А= ОАР 
DUMPA; 
(1,.2, 3) 
(4, 2, 1) 
(8, 3, 4) 

(4,3, 3) 
(7, 2, 5) 
(8, 4, 3) 
B-LOAD 
DUMPB; 
О, 4) 
(8, 9) 
(1, 3) 

Q, 7) 

(2, 9) 

(4, 6) 

(4, 9) 


Ы A and B, 
ша!" AS (al - int, a2 - int, a3 
з Ё int); 


'data2' AS (b1 : int, b2 : int); 


X - JOINA BY al, B BY bl; 
DUMP X; 

(1,2, 3, 1,3) 

(4, 2, 1, 4, 6) 

(4, 3, 3, 4, 6) 

(4, 2, 1, 4, 9) 

(4, 3, 3, 4, 9) 

(8, 3, 4, 8, 9) 

(8, 4, 3, 8, 9) 


Q.29. What are the advantages and disadvantages of Pig ? 


Ans. The advantages and disadvantages of Pig are as follows = 


Advantages — 


(i) 1t decreases the Duplication of data. 
е and save the 


net! 


elop! 
(ii) It reduces the number of lines ofcod dev 


anothe! 


nn. 7 


HIVE and 
: PIG 4 
d functions can be easi] " 
ser define sily progra 
Brammeq for read 


ris nested data models. 


programmer who knows SQL language can easily api 
e to 


tages e 
И provide JDBC and ODBC connectivity 


ledicated metadata data base, 


(iii) ft does ет 
the functions used in Pig. 
in Pig are as follows — 


() Eve 
г expressions. 
(i) Filter — A special type of eval function that returns a logical 


Boolean result. 

(iii) Load — 
fom external storage. 

(iv) Store — A function which specifies how to save contents of a 


A function that specifies how to load data into a relation 


relation to external storage. 
0.31. What types of commands are used in Pig program ? 
Ans. The commands used in Pig program are as follows — 


Table 3.12 Pig Commands 


Command 


Load Read data from the file system 
Store Write data to the file system 
D ! 

ump Write output to stdout 
Е 

Gene 2 Apply expression to each record and generate one or 
Tate more records 

ilt ec 

ы Apply predicate to each record and remove records 

и where false 

TOU inputs 
ый Collect records with the same key from one Or more MPS 
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Join 

Join 
- i two or more inputs based оп 
bua Ort records based оп а key T9 
ба Кетоуе duplicate Tecords 

Me 
rge two datasets 


Limit the number of records 


Split data into two or more sets 


based 
Creates th йе 
€ cross s 
SS product of two or More rel ч, 
айол 


; 
01. What is NoSQL ? Explain. = 

Ans. NoSQL, which means “Not only SQL” is a generic term of database 
management systems (DBMS), which provide a mechanism for storing and 52 
retrieving data different from that of relational DBMS, and hence, traditional 
SQL queries over the data cannot be applied to them. A basic feature of most 
NoSQL datastores is the *shared nothing" horizontal scaling, which allows 
them to execute a huge number of read/write operations per second. Non- 
relational databases are generally known for their schema-less data models, 
improved performance and scalability. 

Conventional relational database system uses two-dimensional table for " 
data creation, with properties like transactions, complex SQL queries, and 
multi-table related query. However, multi-table queries are not effective for 
huge data queries. Scalability in relational databases requires powerful servers 
that are both expensive and difficult to handle. 

Жыл provides the flexibility to store entire data in terms of docume 
tefi А": conventional method of table-row-column. NoSQL is € 
data or Ән we need to access and analyze huge amounts 22 ыды аы 
different N that is stored remotely on multiple virtual servers. ther 

OSQL databases — 

(0 Key-value Stores — \t is a system that stores values indexed ыс 
5y keys. These systems can hold structured or unstructured data E 
Y be distributed to a cluster or a collection of nodes 25 0 Amazon 
DB and Project Voldemort. 


Whole a ti) Column-oriented Databases — It is a system ИМ 
umn instead of a row, which minimizes disk access € 


nts 


Telrieya] 
Сап easil 
Dynamo 


stores data in 
ompared toa 


ponnus | 
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heavily structured table of columns and row 


S with uniform Sizeq Б 
each record as in HBase and Cassandra. ©; h 
t 
(iii) Document-based Stores — These databases store 
organize them as document collections, instead of structured tabl data ang 
sized fields for each re 


cord. With this database, users can add any ii 
а document as implemented in С ouch D Mber op 
(iv) Graph Databases — 

properties to represent and store da 
represent data as a graph-like struc 
Allegro Graph and Neo4i. 


fields of any length to 


These databases use nod 
ta in the form of graphs, | 
ture, which can be easily 


ез қ 
> Ind 
ble о 


traverseg aS in 


О.2. Write down the features of NoSQL. 


Ans. NoSQL databases may not require a predefined table schema ty 
scale horizontally and usually avoid join operations. Bec » Jplcally 


nature and involvement of smaller subset anal 


(1) Scale-out — 
distributed environment by 
databases allow the distribut 
with a distributed processin 
distribution of data to ne 
Scale 


This refers to achieve high performance ina 
using many general-purpose machines, NoSQL 
ion of the data over a large number of machines 
6 load. Many NoSQL databases allow automatic 
w machines when they are added to the cluster. 
-out is evaluated in terms of scalability and elasticity. 

(й) Flexibility — 


Flexibility in terms of data structure says that there 
is no need to define a sch 


ema for databases. NoSQL databases do not require 
à predefined schema. This allows the users to store data of various structures 


in the same database table, However, support for high-level query languages 
such as SQL is not supported by most of the NoSQL databases. 


(iii) Data Replication — One of the features of NoSQL databases 5 
data replication. In this process а copy of the data is distributed to n 
Systems in order to achieve redundancy and load distribution. However 
is a chance of losing data consistency among the replicas. But it is beli Ж 
that sometimes this consistency may be achieved eventually. Consistency 
availability are the factors for evaluating replication. 


Q.3. Write short note on oracle big data. 


В " jon to 

Ans. Oracle is the first vendor to offera complete and iate бін p 

address the full spectrum of enterprise big data тейге" еле enterprst 

tegy is centered on the idea that you can extend pa келі а орі 
Шан architecture to incorporate big data. New big da 


p сщ 


and Oracle NoSQL database, ry 
р 


Oracle Big Data 
Connectors 


Oracle Data 
Integrator 
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ош Oracle data 
Fequirements, 


n alongside у 
Your big data 


Analytic 
Applications 


Warehouse 


Fig. 4.1 (a) Oracle’s Big Data Solutions 


Oracl 


pen ў 
и ендігі big data. 

Oracle big data appliance comes 
in а full rack configuration with 18 
sun servers for a total storage capacity 
of 648 TB. Every server in the rack 
has 2 CPUs, each with 8 cores for a 
total of 288 cores per full rack. Each 
server has 64 GB memory for a total 
of 1152 GB of memory per full rack. 


Oracle big data appliance includes 
acombination of open source software 
and specialized software developed by 


oracle to address enterprise big data 
Tequirements. 


ig Data Appliance — Oracle big data a 
ав Ва м optimized hardware with а 
қ tack to deliver a complete, 


ppliance is an engineered 
comprehensive big data 
Oy solution for acquiring 


Cloudera 
CDH Oracle 
Big Data 
Connectors 


easy-to-depl 


Oracle 
NoSQL 
Database 


Cloudera 
Manager 


Oracle Big Data Appliance 
Plug-in for Enterprise Manager 


Oracle Enterprise 
Linux & Java VM 


Oracle R 
Distribution 


Oracle Big Data Appliance 


Fig. 4.1 (b) High-level Overview of 
Software on Big Data Appliance 


The oracle big data appliance software includes — 


(i) Full distri 
Hadoop (Срна), 


bution of Cloudera’s Distribution including Apache 


(ü) Oracle Big Data Appliance Plug-in for Enterprise Manager. 


(iii) 


Cloudera Manager to administer all aspects of Cloudera CDH. 


бу) Oracle distribution of the statistical package R. 


( 


У) Oracle NoSQL Database Community Edition. 


1 Че Java VM. 
(vi) Oracle Enterprise Linux operating system and Oracle Java 


КРД am ж 
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Q.4. What is the role of NoSQL business drivers ? 

Ans. Most of the organizations supporting single CPU relational 
have come to a cross roads. The need of these organizations are chem 
Businesses have found value in rapidly capturing and analyzing large in 
of variable data, and making immediate changes in their businesses ce 
the information they receive. 

The business drivers names are velocity, volume, variabili 
these play a important role in the emergence of NoSQL solutions, AS cach © 
these drivers applies pressure to the single-processor relational Dod. of 
foundation becomes less stable and in time no longer meets the organizati, its 
needs. In short, volume and on's 
velocity refer to the ability to 
handle large data-sets that arrive 

quickly. Variability refers to how 
diverse data types don’t fit into 
structured tables, and agility 
refers to how quickly an 
organization responds to business 
changes. The NoSQL business 
drivers are shown in fig. 4.2. 


Sing 
ош; 
Sed On 


Sy and apii 


Single Processor 


Relational model 
Fig. 4.2 NoSQL Business Drivers 

(i) Velocity — Though big data problems are a consideration for many 
organizations moving away from RDBMSS, the ability of a single processor 
system to rapidly read and write data is also key. Many single-processor 
RDBMSs are unable to keep up with the demands of real-time inserts and 
online queries to the database made by public-facing websites. RDBMS: 
frequently index many columns of every new row, a process which decreases 
system performance. When single-processor RDBMSs are used as a back 
end to a web store front, the random bursts in web traffic slow down response 


for everyone, and tuning these systems can be costly when both high read and 
write throughput is desired. 

(ii) Volume — Without a doubt, the important factor pusbing 
organizations to found alternatives to their current RDBMSs is а need to quer 
big data using clusters of commodity processors. Until around e 
performance concerns were resovled by purchasing faster processors. т 
time, the ability to increase processing speed was no longer an option. AS 2 
density increased, heat could no longer dissipate fast enough without € 
overheating. This phenomenon, known as the power wall, forced sy 
designers to shift their focus from increasing speed on a single chi 
more processors working together. The need to scale out (also 
horizontal scaling), rather than scale up (faster processors)» 


" 1 
oni serial to parallel processing Where data - 17 = 


;ons 11 
ions us and sent to separate processors to divide ut are split 
onquer the 


Pture and Teport on 


- 2 
Gs. For example, if a business unit wants to Ме 
atew 


WEE: 
rmation even though it doesn’t apply, Mind bien base 
ires the system to be shut down and ALTER TABLE 
еш be run. ed database is large, this process can impact system 
costing time and money. 
iliy - The tos complex ретге building applications using 

gsis the proce of putting data into and getting data out of the database. 
е data has nested and repeated subgroups of data structures, you need 
à le an object-relational mapping layer. The responsibility of this layer is 
to incl te the correct combination of INSERT, UPDATE, DELETE, and 
to wor SQL statements to move object data to and from the RDBMS 
азыноо layer. This process is not simple and is associated with the largest 
harrier to rapid change when developing new or modifying existing applications. 


Q.5. Write short note on NoSQL data architectural patterns. 


fields fora P 


custom re this info 
RDBMS requ 


Ans. There are four basic types of NoSQL data architectural patterns in 
the broad sense, key-value, document oriented, Graph oriented and column 
oriented. A large number of cloud databases have been developed under each 
category, This implies the need to understand the differences among such 
data stores, and which is more suitable to any given data. Key-value model 
allows representing the data in a simple format. Document oriented model 
allows representing data via structured text. In graph oriented data is stored in 
the form of graph. Column oriented model allows representing data in columns 
and data is stored in tables. 


0.6. Explain the key-value data stores with example. 
stipes ae order to handle highly concurrent access 1o database, ы — 
ана OL designed is key-value stores. It is the simplest, still the mos га en 
Value. jn In а key-value store each data consists of a pair of a unique : А е 
кз на order to save data a key gets generated by the een a үні 
Store T ‘ated with the key. And this key-value pair gets submit? А Бі 
atribute © data values stored in key-value stores can have aynam em 
Hence * attached to it and is opaque to the database managemen, T e 
from the cba the only means to access the data values. The D еа я de 
aPplicati ФУ to value depends on the programming language pde 
‘on. An application needs to provide a key to the data stores t 
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retrieve data. Many key-value data stores use a hash function. The арр 

hashes the key and find out the location of the data in the далар, lae: n 
value data stores are row focused. Which means it enables the applicari, 
retrieve data for complete entities. » 


Retrieval of data from a key-value database is shown in fi 
application has specified a key "Emp20" to the data store in order to Ten 
data. Using the hash function the application hashes the key in order ее 
the location of data in the data store. The design of the key should Sup 9 trace 
most frequent queries fired on the data store. Efficiency of the hash e TL the 
design of the key and size of the values being stored are the factors m 
affect the performance of a key-value data store. The operations рег ich 
on such data stores are mostly limited to read and write operations. nar s 
of the simplicity of the key-value data store, it provides users With Dn 
means of storing and fetching data. All other categories of NoSQL are buit 
upon the simplicity, scalability and performance of key-value data Stotes, 


key. 
‘On to 


8. 43, The 


Retrieve Data 
for Key 


Fig. 4.3 A Key-value Data Store 


Examples of key-value data stores are Berkeley DB, Redis, Memcache, 
Riak and DynamoDB. 


0.7. What are the characteristic features of key-value data store ? 


TN М8 7 
Ans. The characteristic features of key-value data stores аге as follo 


" ing data 
(1) Key-value data stores provides a single means of accessing 


by primary key. 


ч 
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це in memory storage to provide fast access with opti 
tonal 


gi) 9 


- Is built on top of thi 
other data mode P of this mode} (0 provide more 
ent databases give the richest que 


i Ty functionality, wh; 
5 a wide variety of operational 19, which 


and real-time analytics 


n be used for applications that require fast chan 


‘xe mobile, gaming, online ads. iib 


ш) aperi no inherent model. que 
8, Give some popular use cases for key-value based data stores, 
0.8. The typical usage for key-value based data stores are as follows — 
A " Distributing Information — They can be used to implement 
риб. | | 
(ii) Queueing — Some key-value data stores like Redis supports 
ы, queues апа set etc. | 
(iii) Keeping Live Information — Applications which need to keep a 
state can use key-value data stores easily. 
(iv) Caching — Quickly storing data for sometimes frequent future 
use, 


0.9. Explain the document oriented data stores with example. 


Ans, Document oriented data stores are used to store and organize data in 
the form of document, At an abstract level document oriented database is 
similar to key-value data store. It also holds value, which an application can 
read or fetch by using a key. Several document databases automatically generate 
= unique key while creating а new document. А document in а document 
icis is an entity, which is a collection of named fields. The feature which 
а ариев the document oriented database from а key-value data store 15 
bie of the data held by the database. Hence the query к 
aplication а with the key only. In order to support Pa bs wires 
with diua яды querying the database not only based nie ud ei 
be self-des values, can switch for document databases. A doc 


кары n 
à cribing in a document oriented database. Information 15 stored 
Portable an 
As shown | i 
Value paire. Wn in fig. 4.4 the document database stores data in fo 
“like key. 


d well understood format such as XML, BSON or JSON. 
rm of key- 


system 
ut the data stored in the database is transparent 10 nin only 
че databases. The application can query the database | 


Bi 
val 


— 


BEEN. 
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with the key i.e. ‘Employee ID” but also with the defined fields in the 
i.e. firstname (FN), lastname (LN), age etc. Document data stores ar 
approach to model data based Key 

on common software (Employee ID) 

problems. But it comes at the 
cost of slightly lower 
performance and scalability in 
comparison to key-value data 
stores. Few examples of the 
most prominent document 
stores are Riak, MongoDB, 


documen 


e fig, 


lent 


CouchDB, Marklogic, exist- LN : Sharma 
db, Berkeley DBXML, Counch 
base etc. Fig. 4.4 A Document Data Store 


Q.10. Write down some popular use case for document oriented data Stores, 
Ans. Some popular use cases for document oriented based data Stores 
are as follows — i 
(1) JavaScript Friendly — Опе of the most critical functionalities of 
document based data stores are the way they interface with applications using 
javascript friendly Javascript object notation (JSON). 
(ii) Nested Information — Document based data stores allow you 
to work with deeply nested complex data structures. 


О.Л. Explain graph oriented data stores with example. 


Ans. Graph databases are considered to be the specialists of highly linked 
data. Therefore it handles data involving a huge number of relationships. There 
are basically three core abstractions of graph database. These are nodes, edges 
which connect two different nodes, and properties. Each node holds information 
about an entity. The edges represent the existence of relationship between the 
entities. Each relationship is havinga relationship type and is directional witha 
start point (node) and an end point. The end point can be some other node 
than that of the start node or possibly the same node. Key-value properties am 

associated not only with the nodes but also with the relationships. The properties 
of the relationships provide additional information about the relationships. The 
direction ofthe relationship determines the traversal path from one node to the 


other in a graph database. 


Ї 
Fig. 4.5 represents a part of the ‘Employee’ database structured as pe 


database. Each node in this graph database represents an employee en hip 
Thesc entities are related with each other through a relationship of relations 
type “knows”. The property associated with the relationship i$ "s 
The key difference between a graph and relational database is data query! 
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relational database 
ugh graph database 
€rsal starts from the 


intensive proces ТЕ Tecutsive join as in 
traversal method. While querying thro 


— use E specified by the application. Tray, 
sses Via relationships to nodes c 


le defined by the application logic. The traversal 


Emp 20 : 
First Name : Raj 
Last Name : Verma 


Knows 
Duration 
1 Years 


Knows 
Duration 
2 Years 


Emp 23 
First Name : Rita 
Last Name : Singh 


Emp 21 
First Name : Jay 
Last Name : Sharma 


Duration 2 Years 


Fig. 4.5 A Graph Oriented Database 


7 ? 
Q.12. What are the characteristic features of graph oriented data store ? 


Ans. The characteristic features of graph oriented da 
follows — р 
(i) It implements very fast graph traversal operations. " 

(4) Graph oriented database also supports indexing of meta data 
enable graph traversal combined with search queries. 


(iü) Application that deal with objects with a large numbe 
relations, 


ta stores are as 


r of inter- 


А А ties 

‚ (iv) Models graphs consisting of nodes and edges with propere? 

describing them. | 
(У) Applications like social networking friends networks, hi 

sed permissions, maps and network topologies. 

0.13. Write down important use cases for graph orie тей bas 


A : R d да 
as ae" important use cases for graph oriented base 


erarchical 
Tole ba: 
ed data store. 


ta stores are 


— 


122 Big Data 
() Modelling and Handling Classification — Graph on 
databases excel in any situation where relationships are involved Met 


data and classifying various information 
very well using these type of data stores. 

(i) Handling Complex Relational Information — Graph E 
database makes it extremely efficient and casy to handle with 
relational information such as the connections between two entitie 
degrees of other entities indirectly related to them. 


0.14. Explain the column oriented database Store with e. хатріе, 


Ans. Column oriented datastores are designed to store huge numbers of 
columns. Data is stored based on column values. Though these datastores % 
the most similar to their traditional relational counterparts, they аге able ^ 
overcome the drawbacks of the latter databases, as they remove null values 
from columns, when values are unknown. They support high Scalability since 
column data can be distributed on several clusters easily. They are also most 
suitable for data mining and analytics applications. Most of these datastores 
employ MapReduce framework to speed ир processing of large amounts or 
data distributed on numerous clusters. 


lente d 
complex byr 
Sand vari, 


Sometimes an application may want to read or fetch a subset of fields, 

similar to the SQL's projection operation. Column family data store enables 
storing data їп column centric approach. The column family data store 
partitions the key space. In NoSQL a key space is considered to be an object 
which holds ali column families of a design together: It is the outer most 
grouping of the data in the data store. Each partition of the key space is 
known to be a Table. Column families are declared by these tables. Each 
column family consists of number of columns. A row in a column family is 
structured as collections of arbitrary number of columns. Each column is a 
map of a key-value pair. In this map, keys are the names of columns and 
columns themselves are the values. Each of these mappings is called a cell. 
Each row in a column-family database is identified by a unique row key, 
defined by the application. Use of these row keys makes the data retrieval 
quicker. In order to avoid overwriting of the cell values few of the popular 
column-family databases add time stamp information automatically 10 
individual columns. 


А simple example data structure of a column datastore is shown in “4 
4.6. It stores information similar to that of the document datastore in fig. 4-6 
but in a different column-oriented format, 


—— 


| 


NoSQL 123 


vue idque [mica] i 
c of the om which was тік 

M cassa” m Software 24 
"n а oy Ape nted in Java. |. 3-1] 
ms and imple? Amazon's [c] 
gundatio" ы on both р үлі: 
f is pase key-value datastore EX = 
pyamoDB ble column datastore, 50 pies] 
И ges Putus of both datastore L. Seafood | 

includes со rts high availability, NITE | 
P es, It SUPPO istence and Solt Drinks 
types: : Jerance, pers! . сте 

tioning (0 It also bas a dynamic 
ин scalability used for a variety of 
schema It ae social networking 

‘cations 

apical anking and finance, and real =r 
anne Some other ШЕЕЗІЕІТЕН 


lytics: 
as ut data stores are ШЕРІГІЗГІГЕЕ 
jam DynamoDB, Apache pio 4,6 Simple Data Structure of a 


- штп Datastore 
Accumulo, Hyper table. Cols 


5. What are the characteristic features of column oriented data stores. 
: i e as 
d The characteristic features of column oriented data isi a 


llows — 
E (i) A column can have mu 


(ii) Extension of key-value model, where the es a а 
(її) Storing a large number of time-stamped da 
sensor data. | m 
(iv) Columns can be generated at run time and not all rows 
ier ҮТ value 
(у) It provides more granular access to си ens 
datastore, but less flexibility than the document oriented dal 
riented based 


time 


tiple time stamped versions. 
ip set of columns. 


event logs and 


no 
9.16. Write down some important use cases of colum 
data stores, 


1 ata sto 
Ans. Some important use cases of column oriented based da 
as follows — 


6) Scaling — These are highly scalable by nature and also h 
Ount of information. : Из {ш 
„ ® Keeping poo Non-volatile Information. ue ан 
collection of attributes and values needs to be kept for long p 
"lum based data stores роте in extremely handy. , 
0.17, Explain the comparison of NoSQL data дса 
Ans. The Comparison of NoSQL database stores 15 shown 


res are 


andle а 


huge ami 


re patterns. 
n table 4.1. 


"APA T 
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? Write down it’s advantages and disadvantages, 


124 Big Data tis cassandra ? га 
та initially developed by face book to handle | 
Table 4.1 Comparison of NoSQL Datab: | 1s initially € arge volume 
- = р Q ase Stores | 0 cassandra agent red by Apache Software foundation in the year of 
ummn-orientet 1 s = 4 H HH 
‘Data Model | Aus: jater it wa> oint of failure, high availability, high scalability, 
| {a and no single р 9, 
EET ESTEE | e provides and high performance of data. Many servers are 
eseni уза ing data | . 
columns structured text and their conncctions 0" fault n ach other and if one node goes down, another node can 
е 4 е А 
— | pigh слей will To allow all the nodes to communicate with 
hing | месо. едо the end user- ip based protocol i 
Data arc stored ide serve the faulty node a gossip protocol is used. 
in tables Each T re provide nd to detect 
identified by a су al 
unique ident Fer «узш each other t ages - 6 book А k 
Any value can be a п Н ce book, twitter, rack s 
structured document € The biggest companies эпо койун mt 
Kcy and valuc arc (0) dra to handle their data. 
parated by a colon Set of links betwcen : uses the cassan . А id inel 
Key-valucpairsam | the objects (edges) and cisco adm dynamo-style replications model to provide no single 
Values in a column are| separated by commas "," i) соп 
stored consccutively | Data enclosed in curly Э re. 
5 мге. А i i 
braces denotes point of fai throughput and quick response time if the 


si i high 
i) It provides | 
Loe in the cluster increases, | 
= (iv) The ACID property is supported for transaction. 
1 
riented data base. -7 


Data enclosed in square 
brackets denotes 
array collection 


With compression : 
Lightweight encoding 
Bit-vector encoding 
Dictionary encoding 
Frame of reference 
encoding 
Differential encoding 


Simple direct graph, 
Indirected multigraph, 
Directed multigraph 


(v) Itisa column о: 


Denormalized model 
with more structure 
(metadata) 
Shattered, equivalent 
to normalization 


Disadvantages — 1. 
(i) It does not support sub query and joi 


(i) Limited support for data aggregation. 


(iii) Limited storage space for a single column value. 
e ? Explain the architecture of Hbase. " 


Weighted graph n operation. 


Techniques 


Hypergraph 


With join algorithm 


o И 0.19. What do you understand Бу Hbas = p 
— JSON documents, Я 1 User-profiles and their " 1 -time ас 
Applications | “Inventory data’ | XML documents оге, attributes Ans, Hbase is the Hadoop database which can provide rea 


: igtable, а 
the data and powerful scalability. Hbase was designed based on the Big 


2 А ; essing big 

БГ ae pire Suppon for multiple , | database was launched by Google. Hbase aims at storing and oa des 
а ying locument types d ал! | а mo 

Er... Support for ран Дан | data easily. More specifically, it uses a general hardware panne eni 

е орге Fault tolerance, | em н tates has multiple li 

Redundancoy, | millions of data. Hbase is an open source, distributed, | file systems 


transactions Scalal 
Suitable for complex 

data, nested documents 

and arrays 


and uses the NoSQL database model. It can be applied on the 2. кен 
| "0n ADFS, In addition, Hbase can use the MapReduce compu of Hbase. It 
Parallel process big data in Hadoop. This is also the core feature ‹ 
can combine data storage with parallel computing perfectly. 


Architecture of Hbase — Hbase is the storage layer in the 


Advantages 


Hadoop. 15 
ork to 


2 query | f aw 
о use wide- Lack оГа standard Very basi ages | Under} in . ; Reduce [ramev 
columns declarative language " Lee P x Beg. ку Storage support is HDFS, using the ims rasta of Hbase 
Disadvantage: i 5 limited Son c . a e arc 
isadvantuge п Ceased’ | ici depend on e ч € data, and cooperate with the ZooKeeper. The 
complex desi parallelism primary Wn in fig. 43. 
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DataNode 


StoreFile 


DataNode 


MemStore 


Store 


HReglon 


ТТТ стрл 


HRegionServer 


DataNode 
4.7 Hbase Architecture 


StoreFile 
DataNode 
ig. 


MemStore 


StoreFile 


H 
pi 
H 
H 
H 
Н 
1 
à 
i 
, 
1 
1 
n 
1 
H 
H 
1 
Н 
1 
H 
; 
1 
1 
H 
H 
H 
' 
i 
Н 
H 
H 
Н 
H 
i 
H 
1 
H 
4 
i 


HRegionServer 
DataNode 


3SVHH 


The four key components are as follows — 


the mana 


(i) Hbase Client — The client is the user of the Hbase. It tak 
ge operations with HMaster and read/write operations with HRegt 


es partin 
ionServer 


| 
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Keeper is the collaborative ma 
er - Zoo р ora nagement node 
à 2% Keep ;stributed collaboration, distributed Synchronization 


SS di А 
бап rwn The ZooKeeper coordinates all the clusters of 
of Н gta а which contains the HMaster address and HRegionServer 
yd ^, using, 
qst OY mation- _ {Master is the controller of the Hbase, lt is 
ys I^ ag Master 5 ing th ; 
sil E Н. dding. deleting, and queuing t с data. It adjusts the 
;ple for ? balance and the Region distribution to ensure that the 
"s load 1 Region when the HRegionServer suffers failure, 


server ex 4 T 
e to the п n launch multiple HMaster to avoid failure, At the 


А са : : nee 
ronment Master Election mechanism working in case of 


is always 2 


r — HRegionServer is the core component of 
g the reading and writing requests for the 


(iv) HRegionServe 
ding operations on HDFS. 


Е ible for bandlin 
iis ee sora а арар 
Explai п of RDBMS and Hbase. 
0.20. Exp 


A sparse, distributed, persistent Row-oriented or 


multidimensional sorted map. 


column-oriented. 


Bytes; data types are interpreted | Rich data type 

on query. support. 

Hadoop-clustered commodity x86 Typically large, 
scalable multi- 


because the underlying storage 
technology is HDFS, which by 


default requires three replicas. 


Yes; built into the Hadoop 
architecture. 


processor systems. 


Yes, if the hardware 
and RDBMS are 


configured correctly. 
Yes 


availability 


Indexes Row-key only or special table 
required. 

Hbase АРІ commands (get, 
Scan, delete, increment, check), 


HiveQL 


Query 501. 


language 


put, 
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Hbase, as the representative database, is often compared witht 
RDBMS. The design target, implementation mechanism, and runnin 
are different. Duc to the reason that the Hbase and RDBMS са 
other in some special situations, it is inevitable to compare КОВ! 


he traditi 
рент 
n Teplace 


MS with H 
5 mentioned before, Hbase is a distributed database System and the unde, 


physical storage uses the Hadoop distributed file system. [t does по Pine 


Onal 
nance 
сас 
base 


particularly strict requirements on the hardware platform. However, RDB ы 
a fixed structure database system. The difference between their design Sis 


makes them have the greatest difference in the implementation mechanis еді 


0.21. Explain in detail about challenges of NoSQL. 


Ans. The promise of the NoSQL has generated a lot of enthusiasm p 
there are many obstacles to overcome before they can appeal to mainstrean 
enterprises. Here, the important challenges are as follows — m 


(0 Maturity — Relational database systems have been around for a 
long time, stable as well as richly functional. NoSQL (Not only SQL) advocates 
will argue that their advancing age is a sign of their Obsolescence, but for most 
CIOs, the maturity of the RDBMS is reassuring. Most, NoSQL alternatives 
are in preproduction versions with large key features yet to be implemented. 
Living on the technological leading edge is a demanding prospect for large 
developers, but enterprises should approach it with extreme caution. 


(ii) Support — Every organization wants the reassurance that if a 
key system fails, they will be able to get competent support as well as timely. 
All relational databases management system vendors go to great lengths to 
support a high level of enterprise. Many NoSQL systems are open source 
projects, and although there are usually one or more firms offering support 
for each NoSQL database, these companies often are small start-ups without 
the global reach, support resources, as well as credibility of a Microsoft, 
Oracle, or IBM. 


Tii) Business Intelligence and Analytics — Not only SQL databases 

have evolved to meet the scaling demands of modern Web 2.0 applications as 
well as offer some facilities for ad-hoc query and analysis. Some relief is 
provided by the emergence of solutions like Hive or Pig that can provide easier 
access to data held in Hadoop clusters as well as perhaps eventually, other 
NoSQL databases. 


(iv) Administration — The main goals for NoSQL may be to provide . 


a zero admin solution, but the current reality falls well short of that goal. 
NoSQL today requires a lot of skill to install as well as a lot of effort 19 
maintain. 

(v) Expertise — Almost each NoSQL developer is in a learning me 
and situation will address naturally over time, but for now, it is far са : 
find experienced RDBMS programmers or administrators than а NoSQL ехе” 
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t the variations of NoSQL architectural patterns, 
d Hu : 
pisc 2-7. of NoSQL architectural patterns are discussed 


ssp Stores — A key-value store that only uses RAM is 
бв flexible and has general tools that application 
it рі bal variables, configuration files, or intermediate 
reg әлеті s. ARAM cache is fast and reliable, and 
- programming construct like an array, a map or 


ident key-value stores are generally empty when the 
gm ly be populated with values on demand. RAM 

pe "d saved to another storage system if we want 
E restarts. We need to define the rules about how 
e 


ist between 5 d between the RAM cache and the rest of our 


partitione 


application. 


tems provide permanent storage and are almost as fast as RAM 
SSD syste! 


i amoDB key value store services uses 
for read ор imn ing м [hes dn read operations. Write 
2. артық cat often be buffered in large RAM caches, resulting in 
к times until the RAM becomes full. | к 

(ii) Distributed Stores — The ability to elegantly and ы 
scale to a large number of processor is a core property of most NoSQ ар и 
Ideally the process of data distribution is transparent to the user me n e 
the API doesn't require you to know how or where your data is store : ten 
knowing that your NoSQL software can scale and how it does this is critic 
їп the software selection process. 


If our application uses many web servers, each caching the result of a 
long-running query, it is most efficient to have a method that allows the servers 
to work together to avoid duplication. This mechanism is known as memcache. 
The memeache-protocol shows that we can create simple communication 
Protocols between distributed systems to make them work efficiently d à 
"a - This type of information sharing can be extended to other NoSQL z " 
Чесше$ such аз column stores (bigtable stores) and document stores. 


б can generalize the key-value pair to other patterns by referring to thema 
ached items, 


The € s А З 
to КИ, items need to be replicated automatically on mu 


WO s " TER 5 
n quj kly Setvers and the first one becomes uuavaiiable, the se 


up" ег 
Felurn the value without waiting for the first server to be 
from backup, 


Mies 


ltiple servers 
ed items are 
cond server 


à ebooted 
T restored 
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(iii) Grouping — The implementation of a collection System 

vary dramatically based on what NoSQL data pattern we use. Keyra a 
have several methods to group similar items based on attributes in thes "оц 
Graph stores associate one ог more group identifiers with cach triple А key, 
systems use column families to group similar columns. Document sto en 
à concept of a document collection. ib. 


% 


0.23. Discuss various ways іп which NoSQL handles big data, 


Ans The most popular four ways in which NoSQL systems hand 


É | 
manage big data obstacles are as follows — Pind 


6) Ву Moving Queries to the Data — When a client wants (о send 
a general query to all nodes that hold data, it is more efficient to send the que 
to each node than it is to transfer large datascts to a central processor. * 


This simple rule helps us to understand how NoSQL databases can have 
dramatic performance advantages over systems that were not designed to 
distribute queries to the data nodes. Consider an RDBMS that has tables 
distributed over two different nodes. In order for the SQL query to work, 
information about rows on one table must all be moved across the network to 
the other node. Larger tables result in more data movement, which results in 
slower queries. Think of all the steps involved. The tables can be extracted, 
serialized, sent through the network interface, transmitted over networks, 
reassembled and then compared on the server with the SQL query. Keepingall 
the data within each data node in the form of logical documents means that 
only the query itself and the final result need to be moved over a network. This 
keeps our big data queries fast. 


(ii) Using Hash Ring to Evenly Distribute the Load — Using a bash 
ring technique to evenly distribute big data loads over many servers with а 
randomly generated 40-character key is a good way to evenly distribute 2 
network load. The hash rings are common in big data solutions because they 
consistently determine how to assign a piece of data to a specific processor 
These rings take the leading bits of a document's hash value and use this to 
determine which node the document should be assigned. This allows any 
node in a cluster to know what node the data lives on and how to adapt » p 
assignment methods as our data grows. Partitioning keys into ranges еч 
assigning different key ranges to specific nodes is known а5 ku 
management. Most NoSQL systems including MapReduce, use keysp? 
concepts io manage distributed computing problems. 
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replication to Scale Reads — The replication strategy 
д) UPE Only а few times there is some lag time between a 
c and a client reading of that same record from a 
ite and then an immediate read from that same 
problem occurs if a read occurs from a replica 
үс happens. This is an example of an inconsistent read. 
he upda ;d this type of problem is to only allow reads to the 
ite has been done. This logic can be added to a 
t system at the application layer. Almost all 
ase consistency rules when a large number of 


Sar, 
b Lo 
Жж” 
а 
ч 


Database to Distribute Queries Evenly to Data 

proach of moving the query to the data rather 
ry. This is an important part of NoSQL big 
ving the query is handled by the database 
d waiting for all nodes to respond is 


" 
E 
а 


(iv Allowing the 
ig, 4.8 shows the ap 
i e data to the que 
isi mo 

tegies. Іп this jnstance, 
pent distribution of the query an 


| theonly responsibil 


ity of the database, not the application layer. 


Primary 
Data Nodes 


/ 
— E 


Primary 
Data Nodes 


Replica 
Data 
Nodes 


Replica 
Data 
Nodes 


T - вна 


INTRODUCTION TO MONGODB 


ыла EGER c есі 


024 What is MongoDB ? 


“цер An БООВ is a famous NoSQL database that is ап open ba 
dito cross-platform, high performance as well as ene y 
азе. MongoDB uses collections to store data as We" 77 
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represent relationships between them and data is in the format of 
documents. It is a binary format in that zero or more key/value BSoy 


stored as a single entity i.c. as a document. BSON is based on Ison are 


documents. JSON (JavaScript Object Notation) is a format that is ы яу, 
computers to parse and gencrate. MongoDB works on concept of c, 1 for 
and document. 9 lection 


Database is a physical container for collections. Each database 
own set of files on the file system. A single MongoDB server бек, 
muitiple databases. d 


Collection is a group of MongoDB documents. It is the equivalent 
RDBMS table. A collection exists within a single database. Collections Pi 
‘enforce a schema. Documents within a collection can have different бед 
Typically, all documents in a collection are of similar or related Purpose, 5 


its 
hag 


Document is a set of kcy-value pairs. Documents have dynamic Schema 
Dynamic schema means that documents in the same collection do not need в 
have the same set of fields or structure, and common fields іп a collection's 
documents may hold different types of data. ni 


Q.25. What are the features of MongoDB ? 

"Ans. MongoDB features include full index support, replication, high 
availability, and auto-sharding. Some important features of MongoDB are 
discussed below — 

(i) Indexing — MongoDB supports secondary indexing, that makes 
retrieval faster as well as unique, compound and geospatial indexing is also possible. 

(ii) Stored JavaScript — Users can also use JavaScript function as 
well as scripts on server side. 

(iii) Aggregation — MongóDB supports MapReducc that is а very 
useful aggregation tool. i mE а 

(iv) Horizontal Scaling — MongoDB scales horizontally. It scales 
out and up easily on a variety of platforms including in the cloud using services 
like Amazon EC2 and Rackspace. 

. @) Sharding — This is a process in which large databases аге broken 
down into different tables so that they can be processed on multiple machines 
and in MongoDB, this is automatic. MongoDB’s sophisticated sharding keys 
make balancing your data across Jarge clusters easy and powerful. 


0.26. How MongoDB used in BlogPost ? 


Ans. Using MongoDB database, our blog posts can do 
collection, with each entry looking like this — With a document type 42 
data is stored almost exactly as it is represented in the program. 


; ingle 
» stored in a 5108 
be sto base 
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LN age 
«i ike YOU 7 | , 
Qu ids «201 5-124701) 
pote: ÈA (1212322, 42.122322), · 
Location : 
Rating 1.1, ( 
a 4 ” 
Бесті «rakesh@hotmail.com a 
Upvotes: 11, 
Downvotes : 04, А 
Text : “1 agree with уои”), 
(User: “dolly@gmail.com X 


Upvotes: 321, 
Downvotes: 11, . 
Text : "You are a man ) 


} : * 10?" 
Tags: ("Politics", *Virginia"]) 


0.27. Give advantages and disadvantages of MongoDB. 
Ans. Advantages of MongoDB - - | 
(i)  Schera-less (without scheme) design enables rapid introduction 
of new CDR (Call Detail Record) types of the System. _ | 
(i) Scale BillRun production site already controls several TB in a 
singe table, without being limited by adding new fields. | 
(iii) Rapid replica set easily enables meeting regulation with ea! 
Setup multi data center DRP as well as HA solution. | 
.. (iv) Sharding enables linear as well as scale out growth without 
tunning Qut of budget. 
architec: (0 With ovér approximately 2,000/s CDR inserts, then cjl 
ture is great for a system that must support high insert load. So y 


(an cack 
бөлі, |y guarantee transactions with findAndModify as well as two-phase 


sy to 


(4) Supports developer oriented queries. 


(i) Location based is being utilized to analyze users usage as well 


as determin: 
tini; 2 
ng where to invest in cellular infrastructure. 


BE um % 
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e is used to store à Boolean (true/false) value. 


ў е ist В 2 
134 Big Data goolean _ This ee is used to store floating point values. 
" ist + 
Disadvantages of MongoDB — (й) pe ple- This This type is used to compare a value against 
я RY Keys ^ 
| (1) The current database is locked when MongoDB is writ; A Mi M BSON elements- ys or list or multiple 
it; therefore this does not allow concurrent writes. MB Ont; rà pighe* This type is used to store arrays қ 
= osl s- . 
(i) Mongo DB reports scalability constraints when th (к p" ji Атау i Й 
hundreds of GB. =н Exceeds ; : е key This can be handy for recording when a document 
0.26. Give reasons to use M. DB. өше " үй) тези ji \ 
А r ongoDB. tied or adde : type is used for embedded documents. 
Ans. Few of the reasons to use MongoDB are as follows — pas bee? p object ^ тїз one ji to store a Null value. 
А А й 6 (vil ig type is use . $ 
(i) Document-oriented - Since MongoDB is a NoSQL ty "m Null - This УР in is used identically to a string however, 
database, therefore instead of having data in a relational type format it sto PS symbol — This да that use a specific symbol type. 
the data in documents. This makes MongoDB very flexible and adaptab] а (9 ей for languages he current date or time in 
real business world situation and requirements. du is generally Ге i є is used to store t : i | 
4 itis ge j Date — This datatyP г own date time by creating object of 
(ii) Ad hoc Queries - MongoDB supports searching by field, тап, i б шас We can specify ou т 
queries, and regular expression searches. Queries can be made to retum Mit Unix іше sing бау, month, year into ! й e the document's ID. 
fields within documents. Date and p. ID — This datatype is used to stor . 
s Қ А (xii) Objec his datatype is used to store binary data. 
(iii) Indexing — Indexes сап be created to improve the performance — | (xiii) Binary Data — This : iavascript code into 
This datatype is used to store JavascniP 


of searches within MongoDB. Any field in a MongoDB document can be (xiv) Code — 
(xi 


This datatype is used to store regular 


indexed. 
" " document. 

(iv) Replication MongoDB can provide high availability with replica (xv) Regular Expression — 
sets. A replica set consists of two or more MongoDB instances. Each replica expression. 
set member may act in the role of the primary or secondary replica at any — : 

ө к diee А уег ЕРВМ5. 

time. The primary replica is ће main server which interacts with the client 030. Give advantages of MongoDB o MS are as follows — 
and performs all the read/write operations. The secondary replicas maintain a Ans. Advantages of MongoDB over RDBBMS are base in which өше 
copy of the data of the primary using built-in replication, When a primary (iy Schema less : MongoDB is document datat incid i ne~ 
replica fails, the replica set automatically switches over to the secondary and collection holds documents. Number of fields, content and size of the 
then it becomes the primary server. can differ from one document to another. 

(у) Load Balancing — MongoDB uses the concept of sbarding to 4) Structure of a single object is clear. 
scale horizontally by splitting data across multiple MongoDB instances. (ii) No complex joins. | ТИЕР 
MongoDB can run over multiple servers, balancing ће load and/or duplicating (iv) Deep query-ability. MongoDB supports dynamic еа 
data to keep the system up and running in case of hardware failure. wit using a document-based query language that is nearly as powe 

as 
0.29. Discuss about the MongoDB datatypes. QL. . Y 
Ans. MongoDB supports following datatypes — (V) Tuning, 

(i) String — This is most commonly used datatype to store the (vi) MonogoDB is easy to scale. Қ 

data. String іп MongoDB must be UTF-8 valid. ln (vii) Conversion/mapping of application objects to database objects 
ei _ Th ; E 1 Integer Needed, 
(ii) Integer — This type is used to store a numerical value. 
(Vili) Uses internal memory for storing the (windowed) Min ses 


can be 32 bit or 64 bit depending upon your server. hab; 
ling faster access of data. 


E ч “айы 
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0.31. What a i 
re the differences between MongoDB and R 
IH D BM, 
S? 


Ans. The di 
e differences betwcen MongoDB and RDBMS 
are 


as follows 


whi 
ich are used to store the data, In tose and Тоуу 
` во ОВ, ih; 

» this 


same structure is kno 
3 'wn as a collecti 
conta : ction. Th 
нм documents which in turn co € Collection 
in turn arc key-valuc pairs. ntaius Fields 


A М 
: 2. the data is stored іп documents 
i E 
€ MS, the column denotes a set of data 

ese in MongoDB are known as Fields. an 


data i 1 
omer à : кы sometimes spread across vari, 
а rder to show a complete vi $e 
» а Join is sometimes formed across «Ыы 


| tables to get 
In M i 
ongoDB, the data is normally stored in a single 


collection, but separated by usi 
b у using embedded 
So there is no concept of joins in imo 


MINING SOCIAL NETWORK 
GRAPHS 


ATIONS OF SOCIAL NETWORK 
ORKS AS A GRAPH, TYPES 
ETWORKS қ 


ocial network ? Explain. 
describe web-based services that 


QI. What iss 

Ans, Social network is a term used to de bas ; 

allow individuals to create a public/semi-public profile within a domain such 
nnect with other users within the network. 


that they can communicatively co 
Social network has improved on the concept and technology of Web 2.0, by 
f User-Generated Content. Simply put, 


enabling the formation and exchange 0 
social network is a graph consisting of nodes and links used to represent 
social relations on social network sites. The nodes include entities and the 


relationships between them forms the links as shown in fig. 5.1. 


Node 
Link 


Fig. 5.1 Social Network Showing Nodes and Links 


Soci 
sharing “a are important sources of online interactions and contents 
observations еи, assessments, approaches, evaluation, influences, 
reviews, Se ane opinions and sentiments expressions borne out in text, 
documents. Befo iscussions, news, remarks, reactions, or some other 

re the advent of social network, the homepages were popularly 


the lat i 
e 1990s which made it possible for average internet users to 


Used in 


oee 
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share informati 
ion. However, th -— 
seem to have tran » the activities on social net i 
Ма кае ыы ыша the World Wide Web (www) de in. iia: 
users regardless of de platforms enable rapid information c 
government of he location, Many organisations, indivi 
network, opas now follow the activ » Individuals 
«s enable bi 422 44 j i 
government а organisations, celebrities, governme 
postings that се E knowledge on how their Ms Offici 
s ther tenc 
network. m out of the enormous data oni 


Bina] 
Ween 


reacts to 
ON Sogi | 


| Social networks need not be social i 
Nas ‹ b social in context, The: 
кк E = 4. business, economic, and biologic soci уе 
отра ee puerta grids, telephone call graphs "- i 
— Буны Наз d Wide Web, and citation networks Ж en 
recommendations are агн тар тақ SEE а К (where produs 
тна : n the preferences of = 
ies Mer ee ЕСЕ examples range from seculis dcr ы oe 
4... адақ and food webs, to the neural bini, 
eer esae abditis elegans (the only creature whose тешер is 
me ca y —— The exchange of e-mail messages Vd 
Mucius = тае ee 2- friendships, sex webs (linking d 
of directors of the largest companie A ne аре кка iux 
een ipanies in the United States) are examples from 


-World 
Works. 
ead of 


,2. Writ T 
Q. rite short note on history of social network analysis. 


Ans. Soci f 
branch as -.. айлу SNA) emerged from the social sciences 
operate, behave, and interni ien for studying why and how social group: 
of individuals or or; sha ш certain ways. A social network is comprised 
common interests of" tions connected by kinship, friendship, beliefs, 
Бекес А ст exchange and many other things. These social 
the edges are the conn егей as a graph where the entities form the nodes and 
levels, from within A че Беруеев Шей, Social networks operate at many 
playa crucial role 1 5 amies to nation-wide and also across nations and 
б a cm UNES: the behaviour of entities within the network. 
scholar новине at the turn of twentieth century, маз the first 
the nature of network sizi Rico Dn на focused on 
toosel е on interaction and to the likelihood of interaction 10 
ely-coupled networks rather than groups. 
ар Кет pe twentieth century, three main traditions in socia 
Kor s Т. Moreno pioneered the systematic recording and analysis 
ionships in small groups, namely in work groups and class 


] network 
of social 
rooms. 


incomple 
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cd by W. Lloyd Warner and Elton Mayo explored 

940, A.R. Radcliffe Brown's presidential address 

d on the systematic study of networks. In the 

ber of researchers worked to combine different 

t independent work was also done by scholars 

ies 1 iversity of California, University of Chicago, 
ichigan State University. They critiqued 

jism and group-based analyses, stating that viewing 

tworks offered more analytic advantage. 

js is an analytical tool. It just provides users with a 

ire further analysis. The SNA can be effective 

man analytical discretion, but as it is can be 

red the wider context of these networks. 


gists stresse 
asing num 


cteristics of a social network. 


е essential chara 
etwork are as follows — 


teristics of a social п 
s that participate in the network. 
people, but they could be something else entirely. 
ship between entities of the network. 
s called friends. Sometimes the 
ither friends or they are not. 
Jationship has a degree. 
intances, or none as 
fraction of the 


0.3. Describe th i 
c essential charac 


Ans. Th: 
a collection of entitie 


(i) There is 
ally, these entities are 

(ii) There is at least one relation 
r its ilk, this relationship 1 
-or-nothing; two people are е 
ples of social networks, the re 
te, e.g., friends, family, acqua 
an example would be the 


Туріс 


On Facebook 0 
relationship is all 
However, in other exam 
This degree could be discre 
in Google+. It could be a real number, 
average day that two people spend talking to each other. 
ЕА эе is an assumption of nonrandomness or locality. This 
duke Ti hardest to formalize, but the intuition is that relationships tend 
рдім is, if entity A is related to both В and C, then there is a higher 
ity than average that B and C are related. 


pa Describe basic terminology of social network. 

ns. e basic terminology of social network is as follows — 

Paths "n. i) Betweenness — Betweenness of a node measures the number of 

the ability to through each individual. This can identify the nodes which has 

network d control the flow of information between different parts of the 

information ese can be called as the gateway nodes. Gateway nodes channel 
running жы most of the others їп the network if they have many paths 
they play a ugh them. If they have a few paths running through them, still 
of the a communication role if they exist between different clusters 
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(ii) Centrality — Centrality is a key term in SNA. A 
network is dominated by one individual who controls th 
knowledge flow and may become a hub of communicat 
centralized network has no hub point of failures. So peop 
information even if some channels arc blocked. 


Й 
highly centra 
e information 
ion failure, A 
le can still pas; 


lizeq 
ünd 
lesg 
S on 
(iii) Degree — Degree of a node specifies the number of li 


other individuals in the network. Higher the degree ofa node, the more inf 


Чела 
it is within the network. | 


(iv) Closeness— Closeness measures the extent to Which an individua] 
is near to all other individuals in a network either directly or indirectly, It 
exhibits the ability to access information through the network members, 

Q.5. Give general applications of social network analysis. 
Ans. The general applications of SNA are as follows — 


(1) For improved customer targeting, for potential promotions based 
on their past purchase history. 


(ii) In identifying loyal customers who are vocal, active and passionate 
and can be characterized as brand ambassadors, . 

(iii) In reducing average churn rates in the telecommunications 
industry by identifying central connectors and offering special rewards or 
customized experiences, 

(iv) In combating terrorist activities by characterizing the network 
organizations to determine the likelihood and impact of terrorist activity. 


(v) In detecting health care fraud by detecting patterns, establishing 
linkage between individuals, and to connect non-obvious relationships. 


Q.6. Write short note on social network as a graph. 

Ans. A social network is conceptualized as a graph, that is, a set of vertices 

.. (or nodes, units, points) representing social entities or objects and a set of 

`- lines representing one or more social relations among them. A network, however, 
is more than a graph because it contains additional information on the vertices 
and lines, 

Formally, a network N can be defined as N = (U, L, Е, Е) containing a 
graph G = (U, L), which is an ordered pair of a unit or vertex set U and a line set 
L, extended with a function Е, specifying a vector of properties of the units (f: U 
— X) and a function F, specifying a vector of properties of the lines (f: L. Y} 
The set of lines L may be regarded as the union of a set of undirected edges E and 
a set of directed arcs A(L - ЕА). Each element e of E (each edge) is 20 


nks to . 


Th 


| phone numbers, which are reall 
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(vertices) from U, i.e. e(u: V), and each element a 


nd V е н 
pair of units Per pair of units u and v (vertices) from U, іе. аби: v). 
ойе arc) is ап OF rents of the nodes network can be divided into two 
(сас he conte 
of A 4 оп t 
ase lows — А E Е 1 
| В өре as fol 4 Economic Network — It consists of a group i people 
jt @ social ап rt of interactions or pattern of communication. 
ected with em business relation between companies and clients, 
conn book, wi > 


ilies involved in a marriage etc. 
rk — The connection between information 


el 
ев” Кр between fam 


ious words and symbols), World Wide 


. 220% п уаг ; 
— Semantic (links betwes ew page connecting to another through 


e.g T Ж various web pages; n! 


А 2 A 
escribe the varieties of social network 


pipi any varieties of social networks other than “friends 
m 


Ans. There are 
networks, such as — 


) Telephone Networks — In these networks the nodes represent 
(i) Tele 


y individuals. There is an a е 7 
і г laced between those phones іп some ixe 

5-2 M dc or “ever”. The edges could - тты 

number of calls made between these phones during the рен, uem varii 

in a telephone network will form from groups of people t ie s e 

frequently. For example groups of friends, members of a club, 

working at the same company. 


(ii) Email Networks — In these networks the nodes represent iue 
addresses, which are again individuals. An edge represents the fact that t ке 
Was at least one email in at least one direction between the two addresses. 
Altematively, we may only place an edge if there were emails in both directions. 
In that way, we avoid viewing spammers as “friends” with all their victims. 
Another approach is to label edges as weak or strong. Strong edges represent 
communication in both directions, while weak edges indicate that the 
communication was in one direction only. The communities seen in email 
ПОЗ come from the same sorts of groupings as in case of telephone 


oleas ^ similar sort of network involves people who text other people 
"ough their cell phones, 


БШ tii) Colluboratioy, Neiworks — Here nodes represent individuals who 
“Ye Publis 


Who publi hed research papers. There is an edge between two individuals 
Published one or more papers jointly. Optionally, we can labet edges 


ER qe 


+ 
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the number of joint publications. The communities in this netwo 


3 : rk are 
working on a particular topic. authors 


An alternative view of the same data is as a graph in which the nodes 
papers. Two papers are connected by an edge if they have at least о ES 


in common. Now, we form communities that are collections Of papers о a 
same topic. ч 


There arc several other kinds of data that form two networks ina 
way. For example, we can look at the people who edit Wikipedia articles and 
thc articles that they edit. Two editors are connected if they have edited an 
article in common. The communities are groups of editors that are interested 
in the same subject. Dually, we can build a network of articles, and connect 
articles if they have been edited by the same person. Here, we get communities 

of articles on similar or related subjects. 


Similar 


Q.8. How can we use social network analysis in health system ? 


Ans. Social networks have a tremendous influence on the health behaviour 
of individuals. The results from social network analysis can be used by the 
government for designing health plans, benefits and to take preventive measures 
during some disease outbreaks. Pharmaceutical companies can target 
demographic groups and specific markets. Health insurance companies can 
design their insurance plans in a better way. 


The health behaviour of people in a specific location is analyzed. One of 
the two graph search algorithms namely Breadth First Search (BFS) and Depth 


First Search (DFS) can be used to get the sub graph L(G) of a specified 
location from the main graph S(G). The graph L(G) can further be narrowed 
down to extract the sub graphs of health communities (HC(G)), health blogs 
(HB(G)), and users in the location (U(G)). Fig. 5.2 depicts the above process. 


BFS А ойып BFS Algorithm 


S(G) L(G) 
HC(G) ис) ШС) 


Fig. 5.2 Narrowing Down the Initial Graph using Breadth 
First Search Algorithm 
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:ng web based crawler, number of users 
ве graphs pde identified and categorized under Healthy 
arious categorie d Preventive Health Behaviour and percentage 
i ee that specific location is calculated. Thesc 
d ee policy plans for different types of dispu 
scd to design E n also be used by NGOs to identify the 
p den » could be planned and conducted. 


he right audience for preventive health programs. 
the 


ocial networks are a5 follows — 
e mode network considers relations 
For example, if we use demographic 


only amon: ented nodes and com 


be treated as а one 
Mostly. networks are considered тер А 
опе mode networks with one set o © Ж 

nodes that аге similar to each other. o АУ © 
Fig. 5.3 (a) illustrates а one mode 

network consisting of a group of 
scientists that share similar features 


or characteristics. 


Fig. 5.3(a) One Mode Network 


sets of nodes with which it is 
considered to have weak ties. One 
Classic example of two mode 
network is the scientific collaboration 
network, in which the two sets of 
nodes are scientists and papers. A 
Scientist is connected to a paper only 
if he/she is an author of that paper. 
Fig. 5.3 (b) illustrates a two mode 
network consisting of a group of 
Scientists that share some common 
Papers being the authors of that 


network t 


— ы» 


bine similar nodes based on this 


Consisting of a Cluster of Scientists 


(ii) Two Mode Networks — Vt considers relations among two n 
se of nodes. In community detection a node or actor may have strong Rec pire 
with a set of nodes that form a community, but need to be analyzed with o 


Fig. 5.3 (b) Two Mode Network 
Paper, Consisting of a Cluster of Scientists 


(їй) Complete/Whole Networks — These networks comprises of a 
hat contains associations among members of a single, enclosed 
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community of nodes. For example, if 
we consider the relational ties among 
all of the experts of a social network 
for a particular topic, it is considered 
as a complete network. This kind of 
network focuses on the pattern of 
connections in the network as a whole 
rather than on individual nodes. Fig. (v4) 
5.3 (c) illustrates a complete network А 
consisting of a group of nodes Ul to Fig. 5.3 (c) Complete Network 
010 and the connections among all Consisting of a Cluster of Ten Node 
the nodes as a whole. А 
(iv) Ego Networks — Ego or ego-centric пе 


tworks, comprises of а 
network that concentrates mainly on the connecti 


ons directly associated With 


the focal actor (called as ego). For example, if we initially choose few nodes | 


from a social network that can be considered as trusted nodes, these nodes 
will be served as egos in the network. Then, the egos of the network can be 
further studied to generate more trusted nodes in the social network. Ego 
networks are considered to be homophilous, i.e.,ego nodes will have the 
Strongest ties with those nodes that 
are similar to cgo nodes in terms of 
key attributes such as, age, gender, 
political views, occupation, etc: Fig. 
5.3 (d) illustrates an ego network 
consisting of an ego node E and all 
other nodes (A1 to A5) referred to as 
alters, who are connected to ego node 
E. The dashed lines indicate the 
connections between alter nodes. Alters Al to AS 
ELLE теқ титрах аиа ТАО ВО АТ 
' CLUSTERING OF SOCIAL GRAPHS, DIRECT DISCOVERY OF : 
COMMUNITIES IN A SOCIAL GRAPH, INTRODUCTION | 
TO RECOMMENDER SYSTEM 


Fig. 5.3 (d) Ego Network Consisting 


0.10. Discuss the clustering of social network graphs. | 


Ans, An important aspect of social networks is that they contain 
communities of entities that are connected by many edges. These typically 
correspond to groups of friends at school or groups of researchers interested 
in the same topic. Let us consider clustering of the graph as a way to odi 
communities. The first step is to define distance measures for social netwo! 
graph and then apply standard clustering methods. 


of an Ego Node E and a Cluster of ` 
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Graphs — Before cc a 
s to 
ial network graph, the first step 15 
a soc 


h have labels, these 
, edges of the graph ! pus 
When duele cabin depending on b pn y 
s distans? y led, as in a “friends graph, the 
ges аге un 


А jstance. " 
ted. Bu ne a suitable dis Voneitsy bare n edge pe 
es are © А т) is 0,1 
i » pen say that the p d x Ar de 
if not. Thus, А Pe is no such edge- We coul ега 
dge (m, п) and а long аз tbe dis 
suci 


cial Network 


tance is closer W 


two values, 
edge. 


i these t и 
Neither of ar 
rue distance measure. The reas 


— 0 and 1 or { and ю- 
that they violate the triangle posed 
at i ere 
des, with two edges between them. That is, 1 
ee nodes, 


i rom X to Z 
then the distance from 
ine er eld Z. We could fix this problem 


d distance 1.5 fora missing edge. 
There are two general 
merative) and second 


valued “distance measures 
wo- 


isat dii 
е 
we OX Y) and (Y, 7), dom 
pide" the sum of the distances Ton 
E ing, say, distance 1 for an edge an 
agrees justering Methods — 
Li; idee meae E e is hierarchical (agglo 
approaches to clustering. The first is 
is point-assignment. | | | mo 
i i hical clustering of a social network graph starts by g 
iem hat nnected by an edge. Successively, edges that are 
some two nodes that are conne А пена 
ша be chosen randomly 
not between two nodes of the same cluster wo | 
combine the clusters to which their two nodes belong. The choices would be 
random, because all distances represented by an edge are the same. 


For Example — Consider the graph of fig. 5.4. First, let us specify the 
communities. At the highest level, it appears that there are two communities 
(X, Y, Z) and (M, N, O, Pj. However, we could also view 
(M, М, О} and (M, О, Р} as two subcommunities of (M, М, О, Р}. These 


two subcommunities overlap in two of their members, and thus could never 


be identified by a pure clustering algorithm. Finally, we could consider each 
pair of individuals that are 


community of size 2, 
although such communities are uninteresting. 


The problem with hierarchical 
Suo Сі С) 
9 


Clustering of a graph like that of fig. 5.4 


is that at Some point we are likely to chose 
lo combine Y and M, even though they 
Social Network ағар) 


my belong in different clusters. The 

Fires Б; аге likely to combine Y and M 
atM, and any cluster ining i 

Containing it, is A 

мы d Fig. 5.4 


‚тезә, . 
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as close to Y and any cluster containing it, as X and Z are to ү, т 
а 1/9 probability that the first thing we do is to combine Y a, here 
cluster. мма 


Às ove 
то Ong 


To reduce the probability of error, we can run hierarchic. 1 
several times and pick the run that gives the most coherent clust З lus 
graph with many comm. -nities there is a significant chance an н 
phases we shall use some edges that connect two nodes tha "pla 
together in any large community. 


tering 
a large 
€ init; 

t do not belong 


The point-assignment approach to clustering social netw, 
based upon the fact that all edges which are at the same din 
introduce a number of random factors that will lead to so E 
assigned to the wrong cluster. 


ks is аво 
ance Will 
me nodes being 


0.11. Discuss the direct discovery of communities in a soci 


Ans. Searching communities by partitioning all the individuals is relati 
efficient, however it does have several limitations. Itis not possible t : | а 
individual in two different communities, and everyone is Meis жүз. 
community. Let us consider a technique for discovering communities. т 


- by looking for subsets of the nodes that have а relatively large number of 


edges among them. : 


Finding Cliques — To find sets of nodes with many edges between 
them we have to start by finding a large clique (a set of nodes with edges 
between any two of them). However, that task is not easy. Not only is 
finding maximal cliques NP-complete, but it is among the hardest of the NP- 
complete problems in the sense that even approximating the maximal clique 
is hard. Further, it is possible to have a set of nodes with almost all edges 
between them, and yet have only relatively small cliques. For example, 
suppose our graph has nodes numbered 1, 2,........, n and there is an edge 
between two nodes i and j unless i and j have the same remainder when 
divided by k. Then the fraction of possible edges that are actually present is 

approximately (К — 1)/k. There are many cliques of size К, of which (1, 2, 
we ‚ kj is but one example. 


Yet there are no cliques larger than k. To see why, observe that any set of 
k + I nodes has two that leave the same remainder when divided by К. This 
point is an application of the “pigeonhole principle”. Since there are only k 
different remainders possible, ve cannot have distinct remainders for each of 
k + I nodes. Thus, no set of k 1 nodes can be a clique in this graph. 


Complete Bipartite Graphs — A complete bipartite graph consists of $ 
nodes on one side and t nodes on the other side, with all st possible edges 
between the nodes of one side and the other present. We denote this graph by 


as 


ipartite graphs i 

en comP 5 of genera 
hs ol g 

one? jete graph 


ible uarantee that а 
e n pp pem to шалап 
large Complete bipartite subget я 
ечат а clique if we discovere 
a it nodes with many edges 
ph itself is k-partite then we 
en them to form a bipartite 
plete bipartite subgraphs 


‚ Jf the gra 


cap take no n search for comi 


graph. In tbis 
as the nuclei 0 

wever, we Сап also 
ЖЖ... in ordinary graphs where nodes all ha 
nodes into two equal groups at random. 54 comm! 
expect about half its nodes to fall into each group, 


i . Thus 
t half its edges would go between groups. | $ ) t 
unm of identifying a large complete bipartite subgraph in the community. 


To this nucleus we can add nodes from either of the two groups, if they раме 

edges to many of ће nodes already identified as belonging to the community. 

Q.12. How to use betweenness to find communities ? Explain with 

example. 
, Ans. The betweenness scores for the edges of a graph behave something 
like a distance measure on the nodes of the graph. It is not exactly a distance 
er ipei it is not defined for pairs of nodes that are unconnected by 
Tien = pie not ed the triangle inequality even when defined 
; We can cluster by taki ee : А қ 

msi ықы у king the edges in order of increasing betweenes 
‘0 the graph one ata бте, At each “ss 
Ў ch step, the connected components 


t d 
n some clusters The + igher he b 
of A graph forn . 1 t! 
E "€ the nO lu : etweenness we allow, the 
B get, and the arge e clusters become. 


bipartite graph, We C? 
f communities. . 
use complete bipa 


rtite subgraphs for community 


ve the same type. Divide the 
unity exists, then we would 


and we would expect that 
we still have a reasonable 


For exam 1 сы 
betweenne р ~ Let us consid 
ss er the 
be done in rm edge in fig 5.5 TER 2s fig. 3:5. We see it ! 
: Culatio: With the 
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between N and O there are two shortest 
paths, one going through M and the 
other through P. Thus, each of the edges 
(М, М), (М, P), (М, О), апа (О, P) are 
credited with half a shortest path. 


Clearly, edge (Y, M) has the highest 
betweenness, So it is removed first. That 
leaves us with exactly the communities 
we observed make the most sense, 
namely; (X, Y, Z} and (M, N, P, О}. 
However, we can continue to remoy 
edges. Next to leave are (X, Y) а 
(Y, Z) with a score of 6, followed by 
(М, М) and (M, О) with a Score of 5.5, Fig. 5.6 All the Edges with 
Then, (M, P), whose score is 5, would Betweenness 5 or More have been 
leave the graph. Remaining graph is Removed 
shown in fig. 5.6. 


The "communities" of fig. 5.6 look strange. One implication is that X and 
Z are more closely knit to each other than to Y. That is, in some sense Y is a 
"traitor" to the community (X, Y, Z) because he has a friend M outside that 
community. Likewise, M can be seen as a "traitor" to the group (M, N, P, О}, 
which is why in fig. 5.6, only N, P, and O remain connected. 


0.13. How to find complete bipartite subgraphs ? Explain 


Ans. Suppose we are given a large bipartite graph G, and we want to find 
instances of K, within it. It is possible to view the problem of finding instances 


of К, , within С as one of finding frequent itemsets. For this purpose, let the | 


"items" be the nodes on one side of G, which we shall call the left side, We 
assume that the instance of K, , we are looking for has t nodes on the left side, 


and we shall also assume for efficiency that t <s, The “baskets” correspond | 


to the nodes on the other side of G (the right side). The members of the basket 
for node у are the nodes of the left side to which у is connected. Finally, let the 

_ support threshold be s, the number of nodes that the instance of K, į has on 
the right side. 

We can now state the problem of finding instances of K,, as that of 
finding frequent itemsets F of size t. That is, if a set of t nodes on the left side 
is frequent, then they all occur together in at least s рае, But thc baskets 
are the nodes on the right side. Each basket corresponds to a node that is 
б ed to all t of the nodes in Е. Thus, the frequent itemset of size Lands 
connecte А і i ear form an instance of K., . 
of the baskets in which all those items арр! r 


5, 


p ча 
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Jt is important to understand that we do not mean a generated subgraph 
formed by selecting some nodes and including all edges. In this context, 
же ly require that there bc edges between any pair of nodes on differeat 
os ili also possible that some nodes on the same side are connected by 


cdges as well. 

For example — Let us consider the bipartite graph of fig. 5.7. The left 
sidc ís the nodes (1, 2, 3, 4) and the right side is (m, n, o, p). The latter are 
the baskets, so basket m consists of “items” 
1 and 4; that is, m = {1, 4}. Similarly, 
n = {2,3}, o = (1) and p = {3}. 

If s = 2 and t= 1, we must find itemsets 
of size 1 that appear in at least two baskets. 
{1} is one such itemset, and {3} is another. 
However, in this tiny example there are no 
itemsets for larger, more interesting values 
of s and t, such as s = t = 2, Fig. 5.7 The Bipartite Graph 


0.14. Discuss about the community or group detection. 


Ans. Community or group detection in online social networks is based on 
studying the social network structure to find individual nodes in the network 
that correlate more with each other than with other related group of users. 
Discovery of such communities lead to intra-communities (users or nodes 
belonging to same community) that are more likely to be connected or 
associated compared to inter-communities (users or nodes belonging to 
different community). Such kind of clustering in groups helps to further make 
estimation about the users in the network, regarding his/her likes and interests, 
tastes and future activities, This is turn, will help in assessing the probability 
of which products he/she would buy, which songs or movies he/she would 
watch, which services he/she may be interested in, and so on. 


* CART 

* Topic Link LDA 
* TUCM 

* TURCM 

* Etc 


* Hierarchical 
* Spectral 

* Partitional 
* Ete 


Methods 


Traditional 


Clustering 
Link-based Methods 
Topic-based Methods 
Link * Topic 
Based Methods 


5.8 Various Categories of Community Detection Methods 


Detecting communities is not only restricted to social networks but also 
is of great significance in various other fields Such as politics, economics, 
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biology, sociology and information networks where systems can be 


as graphs. However, detection of communities in vast, dynamic and оты 
networks is a challenging problem and has caught the attention сотр | 
researchers working in this area of SNAM. ud Many | 


Fig. 58 shows that there are several approaches that can be follow, | 
community detection in social networks, namely the traditional frin 1% 
methods, the link-based methods, the topic-based methods and the ші 

icd 


methods. A brief descri i ity 
А ption of these comm 1 
1%) uni detection Methods 


() Traditional Clustering Methods The hi i 
l = lerarchical 
clustering Constructs a group of nested clusters by Progressive] ме 


(ti) Link-based Methods — The link. based c i i 
0 s ommunity detection 
methods study the links or edges of the network to detect communities. This 


based community detection methods are Hyperlink Induced Topic Search 
(HITS) and Maximum Flow Community (MFC). 


' (iii) Topic-based Methods — The topic-based community detection 
methods generate communities Which are topically similar. This method, 
however, do not concentrate on nodes that may share some explicit 
communication via links. Two common link-based community detection 
methods are Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet 
allocation(LDA ). 


(iv) Topic-link Based Methods — The topic-link based community 
detection approach is the most recent approach used for detecting communities 
in social networks. This hybrid approach is a combination of link-based methods 
and topic-based methods that considers the disadvantages of using only one 
single method for community detection. Two common link-based community 
detection methods are Community-Author-Reci pient-Topic (CART) and Topic- 
Link LDA. 


Q.15. What do you understand by graph partitioning j 


Ans. Graphs are also used by various scientists or r apro e 4. 
an application program, Partitioning a graph is purely а | 
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It helps to reduce complexity of big graphs and also introduces parallelizations, 

Graph partitioning is required in various application problems like social 

networks, road networks, air traffic controls, image analysis ctc. Some 

applications of graph portioning problems аге scientific computing, partitioning 
various stages of VLSI design circuit, task scheduling in multiprocessor 
systems etc. The aim of graph partitioning is to divide the nodes into several 
disjoint parts such that the predefined objective function is minimal. The optimal 
graph partitioning is NP-complete however various approximate algorithms 
are made to solve the problems. 

Graph partitioning is divided into two groups — 

(i) Constrained Partitioning — In this partitioning the parts are of 
equal size. ' 

(i) Unconstrained Partitioning — Іп this partitioning the parts are 
of different size. 

Various algorithms are available for graph partitioning. Among them three 
principle algorithms are Geometric partitioning, Spectral partitioning and Multi- 
level graph partitioning. 

In geometric partitioning the graph is bisected by utilizing those coordinates 
which are obtained if nodes of a graph are available in space. In this partition 
the vertices which are spatially near to each other are taken into one cluster. In 
spectral partitioning the associability of the graph is concluded by finding the 
Eigen vectors with respect to the second smallest Eigen value of Laplacian 
matrix L corresponding to graph. This bisection method is really demanding 
but not feasible for large graphs. 

Multi-level partitioning is highly effective than the classical graph 
partitioning methods. Constrained graph partitioning problems are efficiently 
solved by this method. The main idea is to partition the large graphs into К — 
parts, group the vertices together in a group and deal with this group of vertices 
rather than independent vertices. It has three phases — coarsening phase, initial 
Partitioning phase, and partition refinement phase. 


0.16. Explain the term structural analysis. 


Ans. Graph isa mathematical structure which shows relation between some 
objects. Graph is made up of vertices, nodes, or points which are connected 
using lines, arcs or edges. A graph тау be directed or undirected. A graph 
represented as G — (V, E) where V is set of vertices and E is sct of edges. Graph 
is represented using different data structures like adjacency list, adjacency 
matrices, ence matrices etc. In a social network the node represent individuals 
or organizations and the edges represent the relationship between individuals or 
organizations, Social network Provides set of methods for analyzing the structure 
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. Е - rrerent levels of analysis that ар 
of whole social entities. There are different E c Ed t ate Not 
necessarily mutually exclusive. Generally there are thre ‘th T 
(È) Micro Level — At this level it typically begins with an Individual ө, 
may begin with a small group of individuals. 
(ii) Meso Level — W falls between micro and macro-level, Jt Show. 
f А acro leve eso-level ne т 
the connection between the micro and macro level. Mes level networks are 


low density network. 
(iii) Macre Level — These are large scale networks that traces the 


| 
interactions over large population. These networks are more complex. 


0.і7. What is recommender system ? 

Ans. Recommender Systems (RS) provide recommendations to users about 
a set of articles or services they might be interested in. This facility in online 
social networks (OSNs) has become very popular due to the easy access of 
information on the internet. A few important applications of RS are its use in 
several e-commerce sites, such as Amazon, Flipkart and Firstery, for 
recommendation of items such as movies, books, gadgets, and jewellery. The 
data required for providing recommendations can be obtained explicitly based 
on users' ratings or comments, or implicitly by monitoring user activities, such 
as items checked, books bought, audios heard, web sites visited, and so on. RS 
may also use demographic information of users like occupation, gender or age 
for clustering group of users that may have similar linkings. Fig. 5.9 shows a 
generic RS framework that takes the user profile as input, the item profile and/ 
or the user-item rating matrix. These inputs are fed to a recommender system 
engine which computes and predicts the top-N recommendations for a user. 


It is a challenge for every e-commerce site to correlate consumers with the 
most suitable products which, in turn, enhance customer satisfaction and loyalty. 
Hence, the majority of the e-commerce sites have become interested in RS, 
which provide personalized recommendations to each visited user of a site by 
examining patterns of user interests in products that suggests a user’s taste. 
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Fig. 5.9 The Generic Recommender System Framework 
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